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Abstract 



The thesis presents an attempt at using the syntactic structure in natural language 
for improved language models for speech recognition. The structured language model 
merges techniques in automatic parsing and language modeling using an original 
probabilistic parameterization of a shift-reduce parser. A maximum likelihood rees- 
timation procedure belonging to the class of expectation-maximization algorithms is 
employed for training the model. Experiments on the Wall Street Journal, Switch- 
board and Broadcast News corpora show improvement in both perplexity and word 
error rate — word lattice rescoring — over the standard 3-gram language model. 

The significance of the thesis lies in presenting an original approach to language 
modeling that uses the hierarchical — syntactic — structure in natural language to 
improve on current 3-gram modeling techniques for large vocabulary speech recogni- 
tion. 
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Introduction 



In the accepted statistical formulation of the speech recognition problem [17] the 
recognizer seeks to find the word string 

W ^ SirgumKP{A\W) P{W) 

where A denotes the observable speech signal, P(A|1^) is the probability that when 
the word string W is spoken, the signal A results, and P{W) is the a priori probability 
that the speaker will utter W. 

The language model estimates the values P{W). With W — Wi, W2, ■ ■ ■ ,Wn we 

get by Bayes' theorem, 

n 

p{w) = n ^2,..., Wi^i) (0.1) 
1=1 

Since the parameter space of P{wk\wi,W2, ■ ■ ■ ,Wk-i) is too large ^, the language 
model is forced to put the history W^-i — Wi,W2, ■ ■ ■ ,Wk^i into an equivalence 
class determined by a function ^{Wk-i). As a result, 

n 

P{W)^l[P{wkmWk-i)) (0.2) 

k=l 

Research in language modeling consists of finding appropriate equivalence classifiers 
$ and methods to estimate P{wk\^{Wk-i)). 

The language model of state-of-the-art speech recognizers uses (n — l)-gram equiv- 
alence classification, that is, defines 

^The words wj belong to a vocabulary V whose size is in the tens of thousands. 
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Once the form ^{Wk-i) is specified, only the problem of estimating P{wk\^{Wk-i)) 
from training data remains. 

In most cases, n = 3 which leads to a trigram language model. The latter has 
been shown to be surprisingly powerful and, essentially, all attempts to improve on 
it in the last 20 years have failed. The one interesting enhancement, facilitated by 
maximum entropy estimation methodology, has been the use of triggers [27] or of 
singular value decomposition [4] (either of which dynamically identify the topic of 
discourse) in combination with n— gram models . 

Measures of Language Model Quality 

Word Error Rate One possibility to measure the quality of a language model is 
to evaluate it as part of a speech recognizer. The measure of success is the word error 
rate; to calculate it we need to first find the most favorable word alignment between 
the hypothesis put out by the recognizer W and the true sequence of words uttered 
by the speaker W — assumed to be known a priori for evaluation purposes only — 
and then count the number of incorrect words in W per total number of words in W . 

TRANSCRIPTION: UP UPSTATE NEW YORK SOMEWHERE UH OVER OVER HUGE AREAS 
HYPOTHESIS: UPSTATE NEW YORK SOMEWHERE UH ALL ALL THE HUGE AREAS 

10 000 011100 

:4 errors per 10 words in transcription; WER = 40% 

Perplexity As an alternative to the computationally expensive word error rate 
(WER) , a statistical language model is evaluated by how well it predicts a string of 
symbols Wt — commonly referred to as test data — generated by the source to be 
modeled. 

Assume we compare two models Mi and M2; they assign probability PMi{Wt) 
and PMiO^t), respectively, to the sample test string Wt- The test string has neither 
been used nor seen at the estimation step of either model and it was generated by 
the same source that we are trying to model. "Naturally", we consider Mi to be a 
better model than M2 if PMi{Wt) > PM2{Wt). 
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A commonly used quality measure for a given model M is related to the entropy 
of the underlying source and was introduced under the name of perplexity (PPL) [17]: 

N 

PPL{M) = exp{-l/N In [PM^lW^fe-i)]) (0.3) 

k=l 

Thesis Layout 

The thesis is organized as follows: 

After a brief introduction to language modeling for speech recognition, Chap- 
ter 2 gives a basic description of the structured language model (SLM) followed by 
Chapters 3.1 and 3 explaining the model parameters reestimation algorithm we used. 
Chapter 4 presents a series of experiments we have carried out on the UPenn Treebank 
corpus ([21]). 

Chapters 5 and 6 describe the setup and speech recognition experiments using 
the structured language model on different corpora: Wall Street Journal (WSJ, [24]), 
Switchboard (SWB, [15]) and Broadcast News (BN). 

We conclude with Chapter 7, outlining the relationship between our approach to 
language modeling — and parsing — and others in the literature and pointing out 
what we believe to be worthwhile future directions of research. 

A few appendices detail mathematical aspects of the reestimation technique we 
have used. 
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Chapter 1 

Language Modeling for Speech 
Recognition 

The task of a speech recognizer is to automatically transcribe speech into text. 
Given a string of acoustic features A extracted by its signal processing front-end from 
the raw acoustic waveform, the speech recognizer tries to identify the word sequence 

W that produced A — typically one sentence at a time. Let W be the word string — 
hypothesis — output by the speech recognizer. The measure of success is the word 
error rate; to calculate it we need to first find the most favorable word alignment 
between W and W — assumed to be known a priori for evaluation purposes only — 
and then count the number of incorrect words in the hypothesized sequence W per 
total number of words in W. 

TRANSCRIPTION: UP UPSTATE NEW YORK SOMEWHERE UH OVER OVER HUGE AREAS 
HYPOTHESIS: UPSTATE NEW YORK SOMEWHERE UH ALL ALL THE HUGE AREAS 

10 000 011100 

:4 errors per 10 words in transcription; WER = 40% 

The most successful approach to speech recognition so far is a statistical one 
pioneered by Jehnek and his colleagues [2]; speech recognition is viewed as a Bayes 
decision problem: given the observed string of acoustic features A, find the most 
likely word string W among those that could have generated A: 

W = argmaxwP{W\A) = argmaxwP{A\W) ■ P{W) (1.1) 
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There are three broad subproblems to be solved: 

• decide on a feature extraction algorithm and model the channel probability 
P(74|H^) — commonly referred to as acoustic modeling; 

• model the source probability P{W) — commonly referred to as language mod- 
eling; 

• search over all possible word strings W that could have given rise to A and find 
out the most likely one W; due to the large vocabulary size — tens of thousands 
of words — an exhaustive search is intractable. 

The remaining part of the chapter is organized as follows: we will first describe 
language modeling in more detail by taking a source modeling view; then we will 
describe current approaches to the problem, outlining their advantages and short- 
comings. 

1.1 Basic Language Modeling 

As explained in the introductory section, the language modeling problem is to 
estimate the source probability P{W) where W = Wi,W2, ■ ■ ■ ,Wn is a sequence of 
words. 

This probability is estimated from a training corpus — thousands of words of text 

— according to a modeling assumption on the source that generated the text. Usually 
the source model is parameterized according to a set of parameters Po{W),9 e © 
where © is referred to as the parameter space. 

One first choice faced by the modeler is the alphabet V — also called vocabulary 

— in which the Wi symbols take value. For practical purposes one has to limit the 
size of the vocabulary. A common choice is to use a finite set of words V and map 
any word not in this set to the distinguished type <unknown>. 

A second, and much more important choice is the source model to be used. A 
desirable way of making this choice takes into account: 

• a priori knowledge of how the source might work, if available; 
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• possibility to reliably estimate source model parameters; reliability of estimates 
limits the number and type of parameters one can estimate given a certain 
amount of training data; 

• preferably, due to the sequential nature of an efficient search algorithm, the 
model should operate left-to- right, allowing the computation of 

P{wi,W2, ...,Wn)^ -P(wi) • n"=2-P(^t^!kl ■ ■■Wi-l)- 

We thus seek to develop parametric conditional models: 



The currently most successful model assumes a Markov source of a given order n 
leading to the n-gram language model: 



1.1.1 Language Model Quality 

Any parameter estimation algorithm needs an objective function with respect 
to which the parameters are optimized. As stated in the introductory section, the 
ultimate goal of a speech recognizer is low word error rate (WER). However, all 
attempts to derive an algorithm that would directly estimate the model parameters 
so as to minimize WER have failed. As an alternative, a statistical model is evaluated 
by how well it predicts a string of symbols Wt — commonly referred to as test data 
— generated by the source to be modeled. 

1.1.2 Perplexity 

Assume wc compare two models Mi and M2; they assign probability PMiiWt) 
and PM2(W^t)) respectively, to the sample test string Wt- The test string has neither 
been used nor seen at the estimation step of either model and it was generated by 
the same source that we are trying to model. "Naturally", we consider Mi to be a 
better model than M2 if -PMi(M^t) > -PMalM^t)- It is worth mentioning that this is 




(1.2) 



Pe{wi\wi . . . = Pe{wi\wi-ri+i ■ ■ ■ Wi-i) 



(1.3) 
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different than maximum likelihood estimation: the test data is not seen during the 
model estimation process and thus we cannot directly estimate the parameters of the 
model such that it assigns maximum probability to the test string. 

A commonly used quality measure for a given model M is related to the entropy 
of the underlying source and was introduced under the name of perplexity (PPL) [17]: 

N 

PPL(M) = exp(-l/7V^ In [PMK|«;i...«;i-i)]) (1.4) 
It is easily seen that if our model estimates the source probability exactly: 

PM{Wi\wi . . . Wi-i) = Psource{Wi\wi . . . Wi-i),i ^l...N 

then (1.4) is a consistent estimate of the exponentiated source entropy exp{H source)- 
To get an intuitive understanding for PPL (1.4) we can state that it measures the 
average surprise of model M when it predicts the next word Wi in the current context 

Wi . . .Wi-i. 

Smoothing 

One important remark is worthwhile at this point: assume that our model M is 
faced with the prediction Wi\wi. . . Wi-i and that wi has not been seen in the training 
corpus in context w^. . . Wi^i which itself possibly has not been encountered in the 
training corpus. If PM{wi\wi . . .Wi^i) — then Pm{wi . . .wn) — thus forcing a 
recognition error; good models M are smooth, in the sense that 
3e(M) > s.t. PM{wi\wi . ..Wi_i) > e,ywi e V, (wi . ..Wi_i) e V'~^. 

1.2 Current Approaches 

In the previous section we introduced the class of n-gram models. They assume 
a Markov source of order n, thus making the following equivalence classification of a 
given context: 

[Wl . . . Wi-i] = Wi-n+l ■ ■ ■ Wi-l = K (1.5) 

An equivalence classification of some similar sort is needed because of the impos- 
sibility to get reliable relative frequency estimates for the full context prediction 
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Wi\wi . . .Wi-i. Indeed, as shown in [27], for a 3-gram model the coverage for the 
{wi\wi-2, Wi^i) events is far from sufficient: the rate of new (unseen) trigrams in test 
data relative to those observed in a training corpus of size 38 million words is 21% for 

a 5,000-words vocabulary and 32% for a 20,000-words vocabulary. Moreover, approx. 
70% of the trigrams in the training data have been seen once, thus making a relative 
frequency estimate unusable because of its unreliability. 

One standard approach that also ensures smoothing is the deleted interpolation 
method [18]. It interpolates linearly among contexts of different order 

k=n 

P0{Wi\Wi-n+l ■ ■ ■ Wi-i) = ^ Xk- f{Wi\hk) (1.6) 

k=0 

where: 

• hk — Wi-k+i ■ ■ ■ Wi-i is the context of order k when predicting wf, 

• f{wi\hk) is the relative frequency estimate for the conditional probability P{wi\hk); 



f{wi\hk) 


= C{wi,hk)/C{hk), 


C{hk) 


= C{wi,hk),k:^ l...n, 






f{wi\hi) 






wiev 




— l/\V\,ywi e V, uniform; 



• Xk,k — . . .n are the interpolation coefficients satisfying > 0,k — . . .n 
and E^=^Afc = l. 

The model parameters 6 are: 

• the counts C{hn,Wi); lower order counts are inferred recursively by: 

• the interpolation coefficients Xk,k — . . .n. 

A simple way to estimate the model parameters involves a two stage process: 
1. gather counts from development data — about 90% of training data; 
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2. estimate interpolation coefficients to minimize tfie perplexity of cross-validation 
data — the remaining 10% of the training data — using the expectation- 
maximization (EM) algorithm [14]. 

Other approaches use different smoothing techniques — maximum entropy [5], 
back-off [20] — but they all share the same Markov assumption on the underlying 
source. 

An attempt to overcome this limitation is developed in [27]. Words in the con- 
text outside the range of the 3-gram model are identified as "triggers" and retained 
together with the "target" word in the predicted position. The (trigger, target) pairs 
arc treated as complementary sources of information and integrated with the n-gram 
predictors using the maximum entropy method. The method has proven successful, 
however computationally burdensome. 

Our attempt will make use of the hierarchical structuring of word strings in natural 
language for expanding the memory length of the source. 
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Chapter 2 

A Structured Language Model 

It has been long argued in the Unguistics community that the simple minded 
Markov assumption is far from accurate for modeling the natural language source. 
However so far very few approaches managed to outperform the n-gram model in 
perplexity or word error rate, none of them exploiting syntactic structure for better 
modeling of the natural language source. 

The model we present is closely related to the one investigated in [7], however 
different in a few important aspects: 

• our model operates in a left-to-right manner, thus allowing its use directly in 
the hypothesis search for W in (1.1); 

• our model is a factored version of the one in [7], thus enabling the calculation 
of the joint probability of words and parse structure; this was not possible in 
the previous case due to the huge computational complexity of that model; 

• our model assigns probability at the word level, being a proper language model. 

2.1 Syntactic Structure in Natural Language 

Although not complete, there is a certain agreement in the linguistics community 
as to what constitutes syntactic structure in natural language. In an effort to provide 
the computational linguistics community with a database that reflects the current 
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( (S 

(NP-SBJ 

(NP (NNP Pierre) (NNP Vinken) ) 

(, J 
(ADJP 

(NP (CD 61) (NNS years) ) 
(JJ old) ) 

(, ,) ) 
(VP (MD will) 
(VP (VB join) 

(NP (DT the) (NN board) ) 
(PP-CLR (IN as) 

(NP (DT a) (JJ nonexecutive) (NN director) )) 
(NP-TMP (NNP Nov.) (CD 29) ))) 
(. .) )) 



Figure 2.1: UPenn Treebank Parse Tree Representation 

basic level of agreement, a treebank was developed at the University of Pennsylvania, 
known as the UPenn Treebank [21]. The treebank contains sentences which were 
manually annotated with syntactic structure. A sample parse tree from the tree- 
bank is shown in Figure 2.1. Each word bears a part of speech tag (POS tag): e.g. 
Pierre is annotated as being a proper noun (NNP). Round brackets are used to mark 
constituents, each constituent being tagged with a non-terminal label (NT label): 
e.g. (NP (NNP Pierre) (NNP Vinken) ) is marked as noun phrase (NP). Some non- 
terminal labels are enriched with additional information which is usually discarded 
as a first approximation: e.g. NP-TMP becomes NP. The task of recovering the parsing 
structure with POS/NT annotation for a given word sequence (sentence) is referred 
to as automatic parsing of natural language (or simply parsing). A sub-task whose 
aim is to recover the part of speech tags for a given word sequence is referred to as 
POS-tagging. 

This effort fostered research in automatic part-of-speech tagging and parsing of 
natural language, providing a base for developing and testing algorithms that try to 
describe computationally the constraints in natural language. 
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State of the art parsing and POS-tagging technology developed in the computa- 
tional linguistics community operates at the sentence level. Statistical approaches 
employ conditional probabilistic models P{T/W) where W denotes the sentence to 
be parsed and T is the hidden parse structure or POS tag sequence. Due to the 
left-to-right constraint imposed by the speech recognizer on the language model op- 
eration, we will be forced to develop syntactic structure for sentence prefixes. This 
is just one of the limitations imposed by the fact that we aim at incorporating the 
language model in a speech recognizer. Information that is present in written text 
but silent in speech — such as case information (Pierre vs. pierre ) and punctuation 

— will not be used by our model either. 

The use of headwords has become standard in the computational linguistics com- 
munity: the headword of a phrase is the word that best represents the phrase, all 
the other words in the phrase being modifiers of the headword. For example we refer 
to years as the headword of the phrase (NP (CD 61) (NNS years) ). The lexical- 
ization — headword percolation — of the treebank has proven extremely useful in 
increasing the accuracy of automatic parsers. 

There are ongoing arguments about the adequacy of the tree representation for 
syntactic dependencies in natural language. One argument debates the usage of 
binary branching — in which one word modifies exactly one other word in the same 
sentence — versus trees with unconstrained branching. Learnability issues favor the 
former, as argued in [16] . It is not surprising that the binary structure also lends itself 
to a simpler algorithmic description and is the choice for our modeling approach. 

As an example, the output of the headword percolation and binarization procedure 
for the parse tree in Figure 2.1 is presented in Figure 2.2. The headwords are now 
percolated at each intermediate node in the tree; the additional bit — value or 1 

— indicates the origin of the headword in each constituent. 

2.1.1 Headword Percolation and Binarization 

In order to obtain training data for our model we need to binarizc the UPcnn Tree- 
bank [21] parse trees and percolate headwords. The procedure we used was to first 
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percolate headwords using a context-free (CF) rule-based approach and then binarize 
the parses by again using a rule-based approach. 

Headword Percolation 

Inherently a heuristic process, we were satisfied with the output of an enhanced 
version of the procedure described in [11] — also known under the name "Magerman 
& Black Headword Percolation Rules". 

The procedure first decomposes a parse tree from the treebank into its context- 
free constituents, identified solely by the non-terminal/POS labels. Within each con- 
stituent we then identify the headword position and then, in a recursive third step, 
we fill in the headword position with the actual word percolated up from the leaves 
of the tree. 

The headword percolation procedure is based on rules for identifying the headword 
position within each constituent. They are presented in table 2.1. 

Let Z ^ Yi . . .Yn he one of the context-free (CF) rules that make up a given 
parse. We identify the headword position as follows: 

• identify in the first column of the table the entry that corresponds to the Z 
non-terminal label; 

• search Yi. . .Yn from either left or right, as indicated in the second column of the 
entry, for the Yi label that matches the regular expressions listed in the entry; 
the first matching Yi is going to be the headword of the {Z (Yi ...)... {Yn ■ ■ .)) 
constituent; the regular expressions listed in one entry arc ranked in left to right 
order; first we try to match the first one, if unsuccessful we try the second one 
and so on. 

A regular expression of the type <_CD I ~QP> matches any of the constituents listed 
between angular parentheses. For example, the <"_.!_, _LRB I _RRB> 
regular expression will match any constituent that is not — list begins with <" — 
among any of the elements in the list between <~ and >, in this case any constituent 
which is not a punctuation mark. The terminal labels have _ prepended to them — 



15 



TOP 
ADJP 

ADVP 

CONJP 

FRAG 

INTJ 

LST 

NAC 

NP 

NX 

PP 

PRN 

PRT 
QP 

RRC 
S 



SBARQ 
SINV 

SQ 

UCP 
VP 



right 
right 

right 

left 
left 
left 
left 
right 

right 

right 

left 

left 

left 
left 

left 
right 



SBAR right 



right 
right 

left 

left 
left 



WHADJP right 
WHADVP right 
WHNP right 



_SE _SB 

<~QP I _ JJ I _VBN I "ADJP I _$ I _ JJR> 

<-~PP|~S|~SBAR|_. l_LRB|_RRB> 
<_RBR I _RB I _T0 I ~ADVP> 

<-~PP|~S|~SBAR|_. L,I_"I_"I_'I_M_: l_LRB|_RRB> 



_RB <^_. L, L 
<-_. I_J_'M_ 
<~_. L, L'M 
_LS <"_. |_, L 



<_NNP I _NNPS I ~NP I _NN I _NNS I ~NX I _CD I ~QP I _VBG> 



<' 



_IN _T0 _VBG 

<"_. L, 

~NP ~PP "SBAR 
<^_.|_,l_"l_ 

_RP <^_. L, L 



' I < 



. ' I _ ' I _ : I _LRB I _RRB> 
I _ : I _LRB I _RRB> 
_LRB I _RRB> 
' L' ' L' l_' l_: |_LRB|_RRB> 



. ' I _ M _ : I _LRB I _RRB> 



<_NNP I _NNPS I "NP I _NN I _NNS I "NX I _CD I "QP | _PRP I _VBG> 
I _ ' I _ M _ : I _LRB I _RRB> 



WHPP 
X 



left 
right 



<_NNP I _NNPS I ~NP I _NN I _NNS I "NX I _CD I "QP | _VBG> 
' l_' L' l_: l_LRB|_RRB> 

VBN "PP 

' I _ ' I _ M _ : I _LRB I _RRB> 
"ADVP "SINV ~S "VP 
' l_' L' l_: |_LRB|_RRB> 
' L' ' l_' L' l_: l_LRB|_RRB> 
<_CD|"QP> <_NNP I _NNPS I "NP I _NN I _NNS I "NX> <_DT|_PDT> 
<_ JJR I _JJ> <~_CC l_.L,L"L"L'LM_:l _LRB I _RRB> 
"ADJP "PP "VP <~_. I_, M_" l_' l_M_: l_LRB|_RRB> 
"VP <"SBAR|"SBARQ|"S|"SQ|"SINV> 
<^_. L, L" L' L' l_: l_LRB|_RRB> 
<"S I "SBAR I "SBARQ I "SQ I "SINV> 
<"_. L, L' L' LLRB|_RRB> 

"SQ "S "SINV "SBAR <"_. |_, |_' M_" L' l_M_: l_LRB|_RRB> 

<"VP I _VBD I _VBN I _MD I _VBZ I _VB I _VBG I _VBP> "S "SINV 

<^_. L, L" l_' L' l_: l_LRB|_RRB> 

<_VBD I _VBN I _MD I _VBZ I _VB I "VP I _VBG I _VBP> 

<~_. L, L' L' LM_: l_LRB|_RRB> 

<~_. I_, L' L' L' l_: l_LRB|_RRB> 

<_VBD I _VBN I _MD I _VBZ I _VB I "VP I _VBG I _VBP> 

<^_. L, L" l_' L' l_: l_LRB|_RRB> 

<~_. I_J_' M_" l_' l_M_: l_LRBLRRB> 

_WRB <"_. L, l_' M_" L' l_' l_: LLRBLRRB> 

_WP _WDT _JJ _WP$ "WHNP 

<"_. L, L' L' l_LRB|_RRB> 

_IN <~_. I_, L" L" L' l_' l_: |_LRB|_RRB> 

<^_. L, L" l_' L' l_: l_LRB|_RRB> 



Table 2.1: Headword Percolation Rules 
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Figure 2.3: Binarization schemes 

as in _CD — the non-terminal labels have the ~ prefix — as in ~QP; I is merely a 
separator in the list. 

Binarization 

Once the position of the headword within a constituent — equivalent with a CF 
production of the type Z Y\. . .Yn , where Z,Yi, . . .Yn are non-terminal labels 
or POStags (only for Yj) — is identified to be k, we binarize the constituent as 
follows: depending on the Z identity, a fixed rule is used to decide which of the two 
binarization schemes in Figure 2.3 to apply. The intermediate nodes created by the 
above binarization schemes receive the non-terminal label Z'. 

The choice among the two schemes is made according to the list of rules presented 
in table 2.2, based on the identity of the label on the left-hand- side of a CF rewrite 
rule. 

Notice that whenever k = 1 or k = n — a case which is very frequent — the two 
schemes presented above yield the same binary structure. 

Another problem when binarizing the parse trees is the presence of unary produc- 
tions. Our model allows only unary productions of the type Z ^ Y where Z is a 
non-terminal label and y is a POS tag. The unary productions Z ^Y where both Z 
and Y are non-terminal labels were deleted from the treebank, only the Z constituent 
being retained: (Z (Y (.) (.))) becomes (Z (.) (.)). 
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## first column : constituent label 

## second column: binarization type : A or B 

## A means right modifiers go first, left branching, then left 
## modifiers are attached via right branching 

## B means left modifiers go first, right branching, then right 
modifiers are attached via left branching 



Hit 




THP 


A 
ii 


AD TP 


R 


An VP 


R 


mwr TP 


A 
ii 




A 


TMT 7 


A 


T QT 


A 




D 




D 


MY 
IMA 


D 


PP 


A 
A 


PRN 


A 


PRT 


A 


QP 


A 


RRC 


A 


S 


B 


SBAR 


B 


SBARQ 


B 


SINV 


B 


SQ 


A 


UCP 


A 


VP 


A 


WHADJP 


B 


WHADVP 


B 


WHNP 


B 


WHPP 


A 


X 


B 



Table 2.2: Binarization Rules 
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2.2 Exploiting Syntactic Structure for Language 
Modeling 

Consider predicting the word after in the sentence: 
the contract ended with a loss of 7 cents 
after trading as low as 89 cents. 

A 3-gram approach would predict after from (7, cents) whereas it is intuitively 
clear that the strongest predictor would be contract ended which is outside the 
reach of even 7-grams. What would enable us to identify the predictors in the sentence 
prefix? 

The linguistically correct partial parse of the sentence prefix when predicting 
after is shown in Figure 2.4. The word ended is called the headword of the con- 
stituent (ended (with (...))) and ended is an exposed headword when predicting 
after — topmost headword in the largest constituent that contains it. Our working 




the_DT contract_NN ended_VBD withJN a_DT loss_NN of_IN 7_CD cents_NNS after 

Figure 2.4: Partial parse 

hypothesis is that the syntactic structure filters out irrelevant words and points to 
the important ones, thus enabling the use of information in the more distant past 
when predicting the next word. We will attempt to model this using the concept of 
exposed headwords introduced before. 

We will give two heuristic arguments that justify the use of exposed headwords: 

• the 3-gram context for predicting after — (7 , cents) — is intuitively less sat- 
isfying than using the two most recent exposed headwords (contract, ended) 
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— identified by the parse tree; 

• the headword context does not change if we remove the (of (7 cents)) con- 
stituent — the resulting sentence is still a valid one — whereas the 3-gram 
context becomes (a, loss). 

The preliminary experiments reported in [8] — although the perplexity results 
are conditioned on parse structure developed by human annotators by having the 
entire sentence at their disposal — showed the usefulness of headwords accompanied 
by non-terminal labels for making a better prediction of the word following a given 
sentence prefix. 

Our model will attempt to build the syntactic structure incrementally while travers- 
ing the sentence left-to-right. The word string W can be observed whereas the parse 
structure with headword and POS/NT label annotation — denoted by T — remains 
hidden. The model will assign a probability P{W, T) to every sentence W with every 
possible POStag assignment, binary branching parse, non-terminal label and head- 
word annotation for every constituent of T. 

Let W he a, sentence of length n words to which we have prependcd <s> and 
appended </s> so that Wq —<s> and —</s>. Let Wk be the word k-prefix 

wq . . .Wk of the sentence and WkTk the word-parse k-prefix. To stress this point, a 
word-parse k-prefix contains — for a given parse — those and only those binary sub- 
trees whose span is completely included in the word k-prefix, excluding wq —<s>. 
Single words along with their POStag can be regarded as root-only trees. Figure 2.5 
shows a word-parse k-prefix; h_0 . . h_{-m} are the exposed heads., each head be- 
ing a pair (headword, non-terminal label), or (word, POStag) in the case of a root- 
only tree. A complete parse — Figure 2.6 — is defined as a binary parse of the 
(wi, ti) . . . (w„, t„) (</s>,SE) ^ sequence with the restriction that (</s>, TOP') is 
the only allowed head. Note that ((wi,ti) . . . (w„,tn)) needn't be a constituent, but 
for the parses where it is, there is no a priori restriction on which of its words is the 
headword or what is the non-terminal label that accompanies the headword. This is 

^SB is a distinguished POStag for the sentence beginning symbol js^; SE is a distinguished 
POStag for the sentence end symbol j/s/,; 
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Figure 2.5: A word-parse k- prefix 




(<s>, SB) (w_l, t_l) (w_n, t_n) (</s>, SE) 



Figure 2.6: Complete parse 



one other notable difference between our model and the traditional ones developed in 
the computational linguistics community imposed by the bottom-up operation of the 
model. The manually annotated trees in the treebank (see Figure 2.2) have all the 
words in a sentence as one single constituent bearing a restricted set of non-terminal 
labels: the sentence {S{wi,ti) . . . {wn,tn)) is a constituent labeled with S. 

As it can be observed the UPenn treebank -style trees are a subset of the family 
of trees allowed by our parameterization, making a direct comparison between our 
model and state of the art parsing techniques — which insist on generating UPenn 
treebank -style parses — less meaningful. 

The model will operate by means of three modules: 

• WORD-PREDICTOR predicts the next word tf^+i given the word-parse k- 
prefix WkTk and then passes control to the TAGGER; 

• TAGGER predicts the POStag tk+i of the next word given the word-parse 
k-prefix and the newly predicted word Wk+i and then passes control to the 
PARSER; 

• PARSER grows the already existing binary branching structure by repeatedly 
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generating transitions from the following set: 

(unary, NTlabel), (adjoin-left, NTlabel) or (adjoin-right, NTlabel) 
until it passes control to the PREDICTOR by taking a null transition. NTlabel 

is the non-terminal label assigned to the newly built constituent and {lef t , right} 
specifies where the new headword is percolated from. 

The operations performed by the PARSER are illustrated in Figures 2.7-2.9 and 
they ensure that all possible binary branching parses with all possible headword and 
non-terminal label assignments for the wi . . .Wk word sequence can be generated. 
Algorithm 1 at the end of this chapter formalizes the above description of the sequen- 

h_{-2) h_{-l) h_0 

T_{-m) 
<s> 

Figure 2.7: Before an adjoin operation 

h'_|-l| =h_{-21 h'_0 = (h_|-ll. word, NTtag) 

T'_{-m+l)<-<s> 



Figure 2.8: Result of adjoin-left under NTtag 

h'_{-l)=h_(-2) h' J) = (lij).word, NTtag) 

T'_{-m+ll<-<s> 



Figure 2.9: Result of adjoin-right under NTtag 

tial generation of a sentence with a complete parse. The unary transition is allowed 
only when the most recent exposed head is a leaf of the tree — a regular word along 
with its POStag — hence it can be taken at most once at a given position in the 
input word string. The second subtree in Figure 2.5 provides an example of a unary 
transition followed by a null transition. 
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It is easy to see that any given word sequence with a possible parse and headword 
annotation is generated by a unique sequence of model actions. This will prove very 
useful in initializing our model parameters from a treebank. 

2.3 Probabilistic Model 

The language model operation provides an encoding of a given word sequence 
along with a parse tree W, T into a sequence of elementary model actions and it can 
be formalized as a finite state machine (FSM) — see Figure 2.10. In order to obtain 
a correct probability assignment P{W, T) one has to simply assign proper conditional 
probabilities on each transition in the FSM that describes the model. 



predict word 




Figure 2.10: Language Model Operation as a Finite State Machine 

The probability P{W,T) of a word sequence W and a complete parse T can be 
broken into: 

P{W, T) = 

n+l 

X{[P{wk\Wu-iTk-i)-P{h\Wk-iTk-uWk)-P{Tti\Wk-iTk^^^ (2.1) 

k=l 

P{T^_,\Wk-iTk-uWk,tk) = Y[P{p1\Wk-iTk^uWk,tk,pl . .pi,) (2.2) 

i=l 
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where: 



• Wk-iTk-i is the word-parse {k — l)-prefix 



• Wk is the word predicted by WORD-PREDICTOR 



• tk is the tag assigned to Wk by the TAGGER 



k-i 




Tk — Tk-i 



• Nk ~ 1 is the number of operations the PARSER executes at position k of 
the input string before passing control to the WORD-PREDICTOR (the Nk-th 
operation at position k is the null transition); Nk is a function of T 

• denotes the i-th PARSER operation carried out at position k in the word 

string: 

Pi e { (adjoin-left, NTtag), (adjoin-right, NTtag)}, 1 < i < A^^ , 
=null, i = Nk 

Each (Wk-iTk-i, Wk, • • -Pi-i) is a valid word-parse k-prefix WkTk at position 
k in the sentence, i = l,Nk. 

To ensure a proper probabilistic model over the set of complete parses for any 
sentence W, certain PARSER and WORD-PREDICTOR probabilities must be given 
specific values^: 

• P{Tmll\WkTk) = 1, if h_{-l}.word = <s> and h_{0} 7^ (</s>, TOP') —that 
is, before predicting </s> — ensures that (<s>, SB) is adjoined in the last step 
of the parsing process; 

• - P( (adjoin-right, TOP) I W^feTfe) = 1, 

ifh_0 = (</s>, TOP') and h_{-l}. word = <s> 

^Not all the paths through the FSM that describes the language model will result in a correct 
binary tree as defined by the complete parse, Figure 2.6. In order to prohibit such paths, we 
impose a set of constraints on the probability values of different model components, consistent with 
Algorithm 1 
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- P( (adjoin-right, TOP ')\WkTk) = 1, 

ifh_0 = (</s>, TOP') and h_{-l}.word 7^ <s> 

ensure that the parse generated by our model is consistent with the definition 
of a complete parse; 

• 3e > 0,yWk-iTk-i, P{wk=</s>\Wk-iTk-i) > e ensures that the model halts 
with probability one. 

A few comments on Eq. (2.1) are in order at this point. Eq. (2.1) assigns probabil- 
ity to a directed acyclic graph {W, T) . Many other possible probability assignments 
are possible, and probably the most obvious choice would have been the factorization 
used in context free grammars. Our choice is dictated by its simplicity and left-to- 
right bottom-up operation. This also leads to a proper and very simple word level 
probability estimate — see Section 2.6 — even when pruning the set of parses T. 

Our factorization Eq. (2.1) assumes certain dependencies between the nodes in the 
graph {W,T). Also, in order to be able to reliably estimate the model components 
we need to make appropriate equivalence classifications of the conditioning part for 
each component, respectively. This is equivalent to making certain conditional inde- 
pendence assumptions which may not be — and probably are not — correct and thus 
have a damaging effect on the modeling power of our model. 

The equivalence classification should identify the strong predictors in the context 
and allow reliable estimates from a treebank. Our choice is inspired by [11] and 
intuitively explained in Section 2.2: 

P{wk\Wk-iTk-i) ^ P{wk\[Wk-iTu-i]) ^ P{wk\ho,h.i) (2.3) 
P{tk\wk,Wk-iTk-i) = P(tk\wk,[Wk-iTk-i]) = P(tk\wk,ho.tag,h-i.tag) (2.4) 
P{p^\WkTk) = Pip^liWkTk]) = P{pt\ho, /i-i) (2.5) 

The above equivalence classifications are limited by the severe data sparscncss prob- 
lem faced by the 3-gram model and by no means do we believe that they are adequate, 
especially that used in PARSER model (2.5). Richer equivalence classifications should 
use a probability estimation method that deals better with sparse data than the 
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one presented in section 2.4. The limit in complexity on the WORD-PREDICTOR 
(Eq.2.3) also makes our model directly comparable with a 3-gram model. A few 
different equivalence classifications have been tried as described in section 4.2.1. 

It is worth noting that if the binary branching structure developed by the parser 
were always right-branching and we mapped the POStag and non-terminal tag vo- 
cabularies to a single type, then our model would be equivalent to a trigram language 
model. 

2.4 Modeling Tool 

All model components — WORD-PREDICTOR, TAGGER, PARSER — are con- 
ditional probabilistic models of the type P{u\zi, Z2, ■ ■ ■ , Zn) where u, zi, Z2, ■ ■ ■ ,Zn be- 
long to a mixed set of words, POStags, non-terminal tags and parser operations {u 
only). Let U be the vocabulary in which the predicted random variable u takes values. 

For simplicity, the probability estimation method we chose was recursive linear 
interpolation among relative frequency estimates of different orders fk{-), k = . . .n 
using a recursive mixing scheme (see Figure 2.11): 

Pn{u\zi, ...,Zn)^ 

\{zi, ...,Zn)- Pn~-l{u\zi, Zn-l) + (1 " A(^i, . . . , Zn)) " fn{u\zi, ...,Zn), (2.6) 

P-i{u) — uniformilA) (2.7) 
where: 

• zi, . . . , is the context of order n when predicting u] 

• fk{u\zi,...,Zk) is the order-k relative frequency estimate for the conditional 
probability P{u\zi, . . . , Zk): 



fk{u\zi, . . 




= C{u,zi,... 


, Zk)/C{zi, . . . 




C{u,z-i, . . 


■ , Zk) 


- E ■■ 


■ Yl C{u,zi, 


■ ■ ■ 1 Z}~i -Zfe-j-i . . . -Zn) ) 












C{z,,.. 






■■■,Zk), 
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X{zi, . . . ,Zk) are the interpolation coefficients satisfying 
X{zi, . . . ,Zk) e [0,l\,k ^ . . .n. 

f (ulz, ... z„ ) 
P (ulz, ... Z ) / J, , , 




(u)= 1/ lUI 



Figure 2.11: Recursive Linear Interpolation 

The X{zi, . . . ,Zk) coefficients are grouped into equivalence classes — "tied" — 
based on the range into which the count C{zi, . . . , Zk) falls; the count ranges for each 
equivalence class — also called "buckets" — are set such that a statistically sufficient 
number of events {u\zi, . . . , Zk) fall in that range. The approach is a standard one [18]. 
In order to determine the interpolation weights, we apply the deleted interpolation 
technique: 

• we split the training data in two sets — "development" and 
"cross-validation" , respectively; 

• we get the relative frequency — maximum likelihood — estimates 
fk{u\zi, . . . ,Zk)i k — . . .n from "development" data 

• we employ the expectation-maximization (EM) algorithm [14] for determining 
the maximum likelihood estimate from "cross-validation" data of the "tied" 
interpolation weights X{C{zi, . . . , Zk))^; 

We have written a general deleted interpolation tool which takes as input: 

^Thc "cross-validation" data cannot be the same as the development data; if this were the 
case, the maximum likelihood estimate for the interpolation weights would be A(C(zi, . . . , Zk)) = 
0, disallowing the mixing of different order relative frequency estimates and thus performing no 
smoothing at all 
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• joint counts zi, Z2, ■ ■ ■ , Zn,u gathered from the "development" and " cross- vahdation 
data", respectively 

• initial interpolation values and bucket descriptors for all levels in the deleted 
interpolation scheme 

The program runs a pre-specificd number of EM iterations at each level in the deleted 
interpolation scheme — from bottom up, k — . . .n — and returns a descriptor file 
containing the estimated coefficients. 

The descriptor file can then be used for initializing the module and thus rendering 
it usable for the calculation of conditional probabilities P{u/zi, Z2, ■ ■ ■ , Zn). A sample 
descriptor file for the deleted interpolation statistics module is shown in Table 2.3. 

The deleted interpolation method is not optimal for our problem. Our models 
would require a method able to optimally combine the predictors of different nature 
in the conditioning part of the model and this is far from being met by the fixed 
hierarchical scheme used for context mixing in deleted interpolation estimation. The 
best method would be maximum entropy [5] but due to its computational burden we 
have not used it. 

2.5 Pruning Strategy 

Since the number of parses for a given word prefix Wk grows faster than expo- 
nential^ with k, r^(2*^), the state space of our model is huge even for relatively short 
sentences. We thus have to prune most parses without discarding the most likely 
ones for a given prefix Wk- Our pruning strategy is a synchronous multi-stack search 
algorithm. 

Each stack contains hypotheses — partial parses — that have been constructed 

by the same number of predictor and the same number of parser operations. The 

hypotheses in each stack are ranked according to the ln{P{Wk,Tk)) score, highest on 

to]). Tlie aiiiouiit of seareli is controlled by two parameters: 

"^Thanks to Bob Carpenter, Lucent Technologies Bell Labs, for pointing out this inaccuracy in 
our [9] paper 
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## Stats_Del_Int descriptor file 

## $Id: del_int_descriptor.tex, V 1.3 1999/03/16 17:54:16 chelba Exp $ 



O U CL U O _ 




Tnt ■ 

_ J. 11 u . 


_main_counts_f ile = counts .devel .HH. 


T.T 
_ W 






Stats. 


.Del. 


.Int: 


_held_counts_f ile = counts . check. HH. 


_w 


EO 


gz ; 


Stats. 


.Del. 


.Int: 


_mcLX_order = 4 ; 










Stats. 


.Del. 


.Int: 


_no_iterations = ; 










Stats. 


.Del. 


.Int: 


_no_iterations_at_read_ 


.in = 100 ; 








Stats. 


-Del. 


.Int: 


_predicted_vocabulary_chunk = ; 








Stats. 


.Del. 


.Int: 


_prob_Epsilon = le-07 










Stats. 


.Del. 


.Int: 


lambdas_level . = 2:_. 


.1__0.019 ; 








Stats. 


.Del. 


.Int: 


buckets_level . = 2:_. 


.0__10000000 








Stats. 


.Del. 


.Int: 


lambdas level. 1 = 13: 


1 0.5 0.5 





5 


0.5 



__0 . 449__1__0 . 260__0 . 138__0 . 073 ; 
Stats_Del_Int: :buckets_level. 1 = 13:__0__1__2__4__8__16__32__64 

__128__256__512__1024__10000000 ; 

Stats_Del_Int : : lambdas.level . 2 = 13:__1__0.853__0.787__0.745__0.692 

__0 . 637__0 . 579__0 . 489__0 . 427__0 . 358 
__0.296__0.258__0.213 ; 

Stats_Del_Int: :buckets_level.2 = 13:__0__1__2__4__8 

__16__32__64__128__256 
__512__1024__10000000 ; 

Stats_Del_Int : : lambdas_level .3 = 13 : __1__0 . 935__0 . 905__0 . 878__0 . 855 

__0 . 812__0 . 743__0 . 686__0 . 633__0 . 595 
__0.548__0.515__0.517 ; 

Stats_Del_Int: :buckets_level.3 = 13: __0__1__2__4__8 

__16__32__64__128__256 
__512__1024__10000000 ; 

Stats_Del_Int : : lambdas _ level .4 = 13:__1__0.887__0.859__0.838__0.801 

__0 . 761__0 . 710__0 . 627__0 . 586__0 . 532 
__0.523__0.485__0.532 ; 

Stats_Del_Int: :buckets_level.4 = 13: __0__1__2__4__8 

__16__32__64__128__256 
__512__1024__10000000 ; 



Table 2.3: Sample descriptor file for the deleted interpolation module 
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(k) 

parser op- 
k predict. 



\ 



p parser op " 
k predict. 

p+1 parser 
k predict. 

P_k parser - 
k predict. 



(k') 

parser op 
^k+1 predict. 



^ p parser op ^ 



-^1 



k+1 predict. 



p+1 parser 
k+1 predict. 



^ P_k parser 



k+1 predict. 



\ 



P_k+lparsei^ 
k+1 predict. 



word predictor 
and tagger 



(k+1) 

parser op 
k+1 predict. 



p parser op 
k+1 predict. 



p+1 parser 
k+l predict. 



P_k parser 
k+1 predict. 



P_k+1 parser 
k+1 predict. 



null parser transitions 
parser adjoin/unary transitions 



Figure 2.12: One search extension cycle 

• the maximum stack depth — the maximum number of hypotheses the stack can 
contain at any given time; 

• log-probabiUty threshold — the difference between the log-probability score of 
the top-most hypothesis and the bottom-most hypothesis at any given state of 
the stack cannot be larger than a given threshold. 

Figure 2.12 shows schematically the operations associated with the scanning of a 
new word Wk+i^- 

First, all hypotheses in a given stack-vector are expanded with the following word. 
Then, for each possible POS tag the following word can take, we expand the hypothe- 
ses further. Due to the finite stack size, some arc discarded. We then proceed with 

^Pk is the maximum number of adjoin operations for a k-length word prefix; since the tree is 
binary we have Pk = k — 1 
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the PARSER expansion cycle, which takes place in two steps: 

1. first all hypotheses in a given stack are expanded with all possible PARSER 
actions excepting the null transition. The resulting hypotheses are sent to the 
immediately lower stack of the same stack- vector — same number of WORD- 
PREDICTOR operations and exactly one more PARSER move. Some are dis- 
carded due to finite stack size. 

2. after completing the previous step, all resulting hypotheses are expanded with 
the null transition and sent into the next stack-vector. Pruning can still occur 
due to the log-probabihty threshold on each stack. 

The pseudo-code for parsing a given input sentence is given in Algorithms 2- 4. 

Second Pruning Step 

The pruning strategy described so far proved to be insufficient^ so in order to 
approximately linearize the search effort with respect to sentence length, we chose to 
discard also the hypotheses whose score is more than a fixed log-probability relative 
threshold below the score of the topmost hypothesis in the current stack vector. 
This additional pruning step is performed after all hypotheses in stage k' have been 
extended with the null parser transition. 

Cashed TAGGER and PARSER Lists 

Another opportunity for speeding up the search is to have a cached list of possible 

POStags/parser-operations in a given TAGGER/PARSER context. A good cache-ing 

scheme should use an equivalence classification of the context that is specific enough 

to actually reduce the list of possible options and general enough to apply in almost 

all the situations. For the TAGGER model we cache the list of POStags for a given 

word seen in the training data and scan only those in the TAGGER extension cycle — 

see Algorithm 3. For the PARSER model we cache the list of parser operations seen 

^Assuming that all stacks contain the maximum number of entries — equal to the stack-depth 
— the search effort grows squared with the sentence length 
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in a given {ho-tag,h-i.tag) context in the training data; parses that expose heads 
whose pair of NTtags has not been seen in the training data are discarded — see 
Algorithm 4. 

2.6 Word Level Perplexity 

Attempting to calculate the conditional perplexity by assigning to a whole sentence 
the probability: 

n 

P{W\T*) = l[P{wk+i\Wkn), (2.8) 

k=0 

where T* = argmaxTP{W,T) — the search for T* being carried out according to 
our pruning strategy — is not valid because it is not causal: when predicting Wk+i 
we would be using T* which was determined by looking at the entire sentence. To be 
able to compare the perplexity of our model with that resulting from the standard 
trigram approach, we would need to factor in the entropy of guessing the prefix of 
the final best parse before predicting Wk+i, based solely on the word prefix Wk- 

To maintain a left-to-right operation of the language model, the probability as- 
signment for the word at position A; -|- 1 in the input sentence was made using: 

Piwk+i\Wk) = Pi'^k+i\WkTk)-p{Wk,n), (2.9) 

where Sk is the set of all parses present in our stacks at the current stage k. This 
leads to the following formula for evaluating the perplexity: 

TV 

PPL{SLM) = exp{-l/NY^n[P{wi\\Wi_i)]) (2.10) 

i=l 

Note that if we set p{Wk,Tk) = 6(Tk,T^\Wk) — 0-entropy guess for the prefix of 
the parse Tk to equal that of the final best parse — the two probability assignments 
(2.8) and (2.9) would be the same, yielding a lower bound on the perplexity achievable 
by our model when using a given pruning strategy. 
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Another possibility for evaluating the word level perplexity of our model is to 
approximate the probability of a whole sentence: 

N 

P{W) = ^(^' ^^*'^) (2-11) 

k=l 

where T^''^ is one of the "N-best" — in the sense defined by our search — parses for 
W. This is a deficient probability assignment, however useful for justifying the model 
parameter re-estimation to be presented in Chapter 3. 

The two estimates (2.9) and (2.11) are both consistent in the sense that if the 
sums arc carried over all possible parses we get the correct value for the word level 
perplexity of our model. 

Another important observation is that the next-word predictor probability 
P{wk+i\WkTk) in (2.9) need not be the same as the WORD-PREDICTOR proba- 
bility (2.3) used to extract the structure Tfe, thus leaving open the possibihty of 
estimating it separately. To be more specific, we can in principle have a WORD- 
PREDICTOR model component that operates within the parser model whose role 
strictly to extract syntactic structure and a second model that is used only for the 
left to right probability assignment: 

P2{wk+i\Wk) = J2 Pwp{wk+i\WkTk)-p{Wk,Tk), (2.12) 

p{Wk,n) = p{Wkn)/ ^ p{Wkn) (2.13) 

In this case the interpolation coefficient given by 2.13 uses the regular WORD- 
PREDICTOR model whereas the prediction of the next word for the purpose of 
word level probability assignment is made using a separate model Pwpiwk+ilWkTk). 
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Transition t; //a PARSER transition 

predict (<s>, SB); 

do{ 

//WORD-PREDICTOR and TAGGER 
predict (next_word, POStag) ; 
//PARSER 
do{ 

if (h_{-l}.word != <s>){ 
if(h_0.word == </s>) 

t = (adjoin-right, TOP'); 
else{ 

if(h_0.tag == NTlabel) 

t = [(adjoin-{left, right}, NTlabel), 
null] ; 

else 

t = [(unary, NTlabel), 

(adj oin-{lef t , right} , NTlabel) , 
null] ; 

} 

} 

else{ 

if(h_0.tag == NTlabel) 

t = null; 
else 

t = [(unary, NTlabel), null]; 

} 

}while(t != null) //done PARSER 
}while(! (h_0.word==</s> && h_-[-l} . word==<s>) ) 
t = (adjoin-right, TOP); //adjoin <s>_SB; DONE; 



Algorithm 1: Language Model Operation 
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current_stack_vector // set of stacks at current input position 

future_stack_ vector // set of stacks at future input position 

hypothesis // initial hypothesis 

stack // initial empty stack 

// initialize algorithm 

insert hypothesis in stack; 

push stack at end of current_stack_vector ; 

// traverse input sentence 

for each position in input sentence{ 

PREDICTOR and TAGGER extension cycle; 

current_stack_vector = future_stack_vector ; 

erase f uture_stack_ vector ; 

PARSER extension cycle; 

current_stack_ vector = future_stack_vector ; 
erase f uture_stack_ vector ; 

} 

// output the hypothesis with the highest score; 
output ma:x scoring hypothesis in current_stack_ vector ; 

Algorithm 2: Pruning Algorithm 

current_stack_vector // set of stacks at current input position 
future_stack_ vector // set of stacks at future input position 
word // word at current input position 

for each stack in current_stack_vector{ 

// based on number of predictor and parser operations 
identify corresponding future_stack in f uture_stack_vector ; 

for each hypothesis in stack{ 

for all possible POStag assignments for word{ //CACHE-ING 
expand hypothesis with word, POStag; 
insert hypothesis in future_stack; 

} 

} 

} 



Algorithm 3: PREDICTOR and TAGGER Extension Algorithm 
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current_stack_vector // set of stacks at current input position 
future_stack_ vector // set of stacks at future input position 

// all possible parser transitions but the null-transition 
for each stack in current_stack_vector , from bottom up{ 
// based on number of parser operations 

identify corresponding future_stack in current_stack_vector ; 
for each hypothesis in current_stack{ // HARD PRUNING 

for each parser_transition except the null-transition{//CACHE-ING 

expand hypothesis with parser_transition; 

insert hypothesis in future_stack; 

} 

} 

} 

// null-transition moves us to the next position in the input 
for each stack in current_stack_vector{ 

// based on number of predictor and parser operations 
identify corresponding future_stack in f uture_stack_vector ; 
for each hypothesis in current_stack{ 
expand hypothesis with null-transition; 
insert hypothesis in future_stack; 

} 

} 

prune f uture_stack_vector //SECOND PRUNING STEP 



Algorithm 4: Parser Extension Algorithm 
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Chapter 3 

Structured Language Model 
Parameter Estimation 



As outlined in section 2.6, the word level probability assigned to a training/test 
set by our model is calculated using the proper word-level probability assignment 
in equation (2.9). An alternative which leads to a deficient probability model is to 

sum over all the complete parses that survived the pruning strategy, formalized in 
equation (2.11). Let the likelihood assigned to a corpus C by our model Pq be denoted 
by: 

• >C^^^(C, Po), where Pe is calculated using (2.9), repeated here for clarity: 

Piwk+i\Wk) = Pi'^k+i\WkTk)-piWk,n), 
p{Wk,n) = P{WuTk)/ E P{^kTk) 

Tk&Sk 

Note that this is a proper probability model. 

• C^{C,Pe), where Pe is calculated using (2.11): 

N 

P{W) = ^P(iy,TW) 
fe=i 

This is a deficient probability model: due to the fact that we are not summing 
over all possible parses for a given word sequence W — we discard most of them 
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through our pruning strategy — we underestimate the probabihty P{W) and 
thus EwP{W) < 1. 

One seeks to devise an algorithm that finds the model parameter values which 
maximize the likelihood of a test corpus. This is an unsolved problem; the standard 
approach is to resort to maximum likelihood estimation techniques on a training 
corpus and make provisions that will ensure that the increase in likelihood on training 
data carries over to unseen test data. 

In our case we would like to estimate the model component probabilities (2.3 - 
2.5). The smoothing scheme outlined in Section 2.4 is intended to prevent overtraining 
and tries to ensure that maximum likelihood estimates on the training corpus will 
carry over to test data. Since our problem is one of maximum likelihood estimation 
from incomplete data — the parse structure along with POS/NT tags and headword 
annotation for a given observed sentence is hidden — our approach will make heavy 
use of the EM algorithm variant presented in chapter 3.1. 

The estimation procedure proceeds in two stages: first the "N-best training" 
algorithm (see Section 3.2) is employed to increase the training data "hkelihood" 
C^{C, Pq); we rely on the consistency property outlined at the end of Section 2.6 to 
correlate the increase in £^(C, Pe) with the desired increase of >C^^^(C, Pe)- The initial 
parameters for this first estimation stage are gathered from a treebank as described 
in Section 3.2.1. 

The second stage estimates the model parameters such that C^'^^{C,Pg) is in- 
creased. The basic idea is to realize that the WORD-PREDICTOR in the structured 
language model (as described in chapter 2) and that used for word prediction in the 
£L2R(^Q^ Pg) calculation can be estimated as two separate components: one that is 
used for structure generation and a second one which is used strictly for predicting 
the next word as described in equation (2.9). The initial parameters for the second 
component are obtained by copying the WORD-PREDICTOR estimated at stage 
one. 

As a final step in refining the model we have linearly interpolated the structured 
language model (2.9) with a trigram model. Results and comments on them are 
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presented in the last section of the chapter. 



3.1 Maximum Likelihood Estimation from Incom- 
plete Data 

In many practical situations we are confronted with the following situation: we are 
given a collection of data points T = {yi, . . . , Un}, Ui & y — training data — which 
we model as independent samples drawn from the Y marginal of the parametric 
distribution: 

qe{x,y),9e e,xeX,yey 

where X is referred to as the hidden variable and X as the hidden event space, 
respectively. The set 

Q(e)^{qo(x,Y):eee} 

is referred to as the model set. Let /r(^) be the relative frequency probability 
distribution induced on y by the collection T. 

We wish to find the maximum-likelihood estimate of 6: 

C{T-qe) = $:/r(?/)log($:?,(x,y)) (3.1) 

e* = argmax£(r;ge) (3.2) 

Starting with an initial parameter value 9i, it is shown that a sufficient condition 
for increasing the likehhood of the training data T (see Eq. 3.1) is to find a new 
parameter value ^j+i that maximizes the so called EM auxiliary function defined as: 

EMrfiM - Y.fr{y)E,,^{x\Y)[\og{qe{X,Y)\y)ldeQ (3.3) 

The EM theorem proves that choosing: 

Oi+i = argmax EMrfiM (3.4) 

ensures that the likelihood of the training data under the new parameter value is not 
lower than that under the old one, formally: 

C{r-qe,^,) > C{T-qe,) (3.5) 
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Under more restrictive conditions on the model family Q{Q) it can be shown that 
the fixed points of the EM procedure — 9i — ^j+i — are in fact local maxima of the 
likelihood function C(T; qe),0 e ©. The study of convergence properties under dif- 
ferent assumptions on the model class as well as different flavors of the EM algorithm 
is an open area of research. 

The fact that the algorithm is naturally formulated to operate with probability 
distributions — although this constraint can be relaxed — makes it attractive from a 
computational point of view: an alternative to maximizing the training data likelihood 
would be to apply gradient maximization techniques; this may be particularly difficult 
if not impossible when the analytic description of the likelihood as a function of the 
parameter 6 is complicated. 

To further the understanding of the computational aspects of using the EM algo- 
rithm we notice that the EM update (3.4) involves two steps: 

• El-step: for each sample y in the training data T, accumulate the expectation 
of \og{qe{X ,Y)\y) under the distribution q0.{x\y)] no matter what the actual 
analytic form of \og{qe{X, Y)) is, this requires to traverse all possible derivations 
{x,y) of the seen event y that have non-zero conditional probability qei{X — 
x\Y^y)>{)- 

• M-step: find maximizer of the auxiliary function (3.3). 

Typically the M-stcp is simple and the computational bottleneck is the E-step. 
The latter becomes intractable with large training data set size and rich hidden event 
space, as usually required by practical problems. 

In order to overcome this limitation, the model space Q{Q) is usually structured 
such that dynamic programming techniques can be used for carrying out the E- 
step — see for example the hidden Markov model(HMM) parameter reestimation 
procedure [3]. However this advantage does not come for free: in order to be able to 
structure the model space we need to make independence assumptions that weaken 
the modeling power of our parameterization. Fortunately we are not in a hopeless 
situation: a simple modification of the EM algorithm allows the traversal of only a 
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subset of all possible {x,y),x e X\y for each training sample y — the procedure is 
dubbed "N-best training" — thus rendering it applicable to a much broader and more 
powerful class of models. 

3.1.1 N-best Training Procedure 

Before proceeding with the presentation of the N-best training procedure, we 
would like to introduce a view of the EM algorithm based on information geometry. 
Having gained this insight we can then easily justify the N-best training procedure. 
This is an interesting area of research to which we were introduced by the presentation 
in [6]. 

Information Geometry and EM 

The problem of maximum likelihood estimation from incomplete data can be 
viewed in an interesting geometric framework. Before proceeding, let us introduce 
some concepts and the associated notation. 

Alternating Minimization Consider the problem of finding the minimum Eu- 
clidean distance between two convex sets A and B: 




Figure 3.1: Alternating minimization between convex sets 
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with a random point oi e A; find the point bi E B closest to oi; then fix bi and 
find the point a2 E A closest to bi and so on. It is intuitively clear that the distance 
between the two points considered at each iteration cannot increase and that the fixed 
point of the above procedure — the choice for the (a, b) points docs not change from 
one iteration to the next — is the minimum distance d* between the sets A and B. 

Formalizing this intuition proves to be less simple for a more general setup — the 
specification of sets A and B and the distance used. Csiszar and Tusnady have de- 
rived sufficient conditions under which the above alternating minimization procedure 
converges to the minimum distance between the two sets [13]. As outlined in [12], 
this algorithm is applicable to problems in information theory — channel capacity 
and rate distortion calculation — as well as in statistics — the EM algorithm. 

EM as alternating minimization Let Q{Q) be the family of probability distribu- 
tions from which we want to choose the one maximizing the likelihood of the training 
data (3.1). Let us also define a family of desired distributions on X x y whose Y 
marginal induced by the training data is the same as the relative frequency estimate 
MY): 

Pr^{p{X,Y):p{Y) = fr{Y)} 

For any pair {p,q) E Pr x Q(©), the KuUback-Leibler distance (KL-distance) 
between p and q is defined as: 

D{p\\q) = E p{x,y)log^^ (3.7) 

As shown in [13], under certain conditions on the families Pq- and Q{Q) and using 
the KL-distance, the alternating minimization procedure described in the previous 
section converges to the minimum distance between the two sets: 

D(p* II q*) = min D(p \\ q) (3.8) 
pePT,qeQ{e) 

It can be easily shown (see appendix A) that the model distribution q* that 
satisfies (3.8) is also the one maximizing the likelihood of the training data, 

q* = arg max £{T; qg) 
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Moreover, the alternating minimization procedure leads exactly to the EM update 
equation(3.3, 3.4), as shown in [13] and sketched in appendix B. 

The Pr and Q(6) famihes one encounters in practical situations may not satisfy 
the conditions specified in [13]. However, one can easily note that decrease in D{p || q) 

at each step and correct 1-projection from q G Q{Q) to Pq finding p G Pr such 

that we minimize D{p\\ q) — are sufficient conditions for ensuring that the likelihood 
of the training data does not decrease with each iteration. Since in practice we are 
bound by computational limitations and we typically run just a few iterations, the 
guaranteed non-decrease in training data likelihood is sufficient. 

3.1.2 N-best Training 

In the "N-best" training paradigm we use only a subset of the conditional hidden 
event space X\y., for any given seen y. Associated with the model space (5(0) we now 
have a family of strategies to sample from A']y a set of "N-best" hidden events x, for 
any y & y. The family is parameterized by ^ G ©: 

Se = {se : y ^ 2"" e Q} (3.9) 

With the following definitions: 

ql{X,Y) - qg{X,Y)-l,^^Y){X) (3.10) 

Qis,e) = {qiix,Y):eee} (3.12) 

the alternating minimization procedure between Pr and Q{S,Q) using the KL- 
distance will find a sequence of parameter values 6*1, . . . , ^„ for which the "likelihood" : 

^'{'r;ql) = j:My)log{j:ql{x,y)) (3.13) 

is monotonically increasing: C^{T;ql_^) < C^iT'^q^^) < ... < C^{T;qg^). Note that 
due to the truncation of qe{X, Y) we are dealing with a deficient probability model. 

The parameter update at each iteration is very similar to that specified by the EM 
algorithm under some sufficient conditions, as specified in Proposition 1 and proved 
in Appendix C: 



43 



Proposition 1 Assuming that \/9 e Q, Sup{qg{x,y)) — X x y ("smooth" qo{x,y)) 
holds, one alternating minimization step between Pr and Q{S, ©) — 9i — > ^j+i — is 
equivalent to: 

0i+i = aTgmax^fr{y)E,.ix\Y)[log{qg{X,Y)\y)] (3.14) 

if 9i^i satisfies: 

seM^se,^Ay)yy^T^ (3-15) 
Only 9 E Q s.t. soiiy) C S0{y),\/y e T are candidates in the M-step. 

The fact that we are working with a deficient probabihty model for which the 
support of the distributions qQ.{X\Y — y),yy e T cannot decrease from one iteration 
to the next makes the above statement less interesting: even if we didn't substantially 
change the model parameters from one iteration to the next — 9i+i ~ 9i — but we 
chose the sampling function such that sg-{y) C S0-^-^{y),\/y G T the "likelihood" 
jC'^iT; g|) would still be increasing due to the support expansion, although the quality 
of the model has not actually increased. 

In practice the family of sampling functions Sg (3.9) is chosen such that support 
of q0.{X\Y — y),\/y e T has constant size — cardinality, for discrete hidden spaces. 
Typically one retains the "N-best" after ranking the hidden sequences x e X\y in 
decreasing order according to q0^{X\Y — y),\fy E T . Proposition 1 imphes that the 
set of "N-best" should not change from one iteration to the next, being an invariant 
during model parameter reestimation. In practice however we recalculate the "N- 
best" after each iteration, allowing the possibility that new hidden sequences x are 
being included in the "N-best" list at each iteration and others discarded. We do 
not have a formal proof that this procedure will ensure monotonic increase of the 
"likelihood" C'{T;ql). 

3.2 First Stage of Model Estimation 

Let {W, T) denote the joint sequence of W with parse structure T — headword and 
POS/NT tag annotation included. As described in section 2.2, W,T was produced by 



44 



a unique sequence of model actions: word-predictor, tagger, and parser moves. The 
ordered collection of these moves will be called a derivation: 

d{W,T)^{e,,...,ei) 

where each elementary event 

identifies a model component action: 

• m denotes the model component that took the action, 
m e {WORD-PREDICTOR, TAGGER, PARSER }; 

• is the action taken: 

— uisa word for m = WORD-PREDICTOR; 

— is a POS tag for m = TAGGER; 

— e {(adjoin-left , NTtag), (adjoin-right, NTtag), null} 
for m = PARSER; 

• his the context in which the action is taken (see equations (2.3 - 2.5)): 

— z = ho-tag, ho.word, h^i.tag, h^i.word for m = WORD-PREDICTOR; 

— z — w, ho.tag, h-i.tag for m — TAGGER; 

— z — h-i.word, h-i.tag, ho-word, ho.tag for m — PARSER; 

For each given {W, T) which satisfies the requirements in section 2.2 there is a unique 
derivation d{W,T). The converse is not true, namely not every derivation corre- 
sponds to a correct (VF, T); however, the constraints in section 2.3 ensure that these 
derivations receive probability. 

The probability of a {W, T) sequence is obtained by chaining the probabihties of 
the elementary events in its derivation, as described in section 2.3: 

length{d{W,T)) 

P{W,T) = P{d{W,T)) = n Pi^^) (3-16) 
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The probability of an elementary event is calculated using the smoothing technique 
presented in section 2.4 and repeated here for clarity of explanation: 

Pn{u\zi, . . . , Zn) = 

X{zi, ...,Zn)- Pn-l{u\zi, . . . , Z^-l) + (1 - A(2;i, . . . , Zn)) ■ fn{u\zi, . . . , 2;„), (3.17) 

P-i{u) = uniformiU) (3.18) 

• ^1, . . . , ^„ is the context of order n when predicting m; W is the vocabulary in 

which u takes values; 

• fk{u\zi, . . . ,Zk) is the order-k relative frequency estimate for the conditional 



probability P{u\zi, 


...,Zk): 






fk{u\zx, . . . 




C{u,Zi,...,Zk)/C{zi,.. 


.,Zk),k^ 0...n, 


C{u,zi, . . . 


',Zk) — 


J2 ■■■Y. c{u,zi,.. 


■ 1 ^kt ^k+1 ■ ■ ■ ^n) 1 


C(z^, . . . 




C{u,Zi,...,Zk), 





• Afc are the interpolation coefficients satisfying 0<Ajfc<l,/c = 0...n. 

The \{zi, . . .,Zk) coefficients are grouped into equivalence classes — "tied" — 
based on the range into which the count C{zi, . . . ^z^) falls; the count ranges for 
each equivalence class are set such that a statistically sufficient number of events 
{u\zi, . . . , Zk) fall in that range. 

The parameters of a given model component m are: 

• the maximal order counts C^'^\u, zi, . . . , Zn)] 

• the count ranges for grouping the interpolation values into equivalence classes 
— "tying"; 

• the interpolation value for each equivalence class; 
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Assuming that the count ranges and the corresponding interpolation values for each 
order are kept fixed to their initial values — see section 3.2.1 — the only parameters to 
be reestimated using the EM algorithm are the maximal order counts C^"*^ {u, zi, . . . , Zn) 
for each model component. 

In order to avoid traversing the entire hidden space for a given observed word 
sequence^ we use the "N-best" training approach presented in section 3.1.1 for which 
the sampling strategy is the same as the pruning strategy presented in section 2.5. 

The derivation of the reestimation formulas is presented in appendix D. The E- 
step is the one presented in section 3.1.2; the M-step takes into account the smoothing 
technique presented above (equation (3.17)). 

Note that due to both the smoothing involved in the M-step and the fact that the 
set of sampled "N-best" hidden events — parses — are reevaluated at each iteration we 
allow new maximal order events to appear in each model component while discarding 
others. Not only are we estimating the counts of maximal order n-gram events in 
each model component — WORD-PREDICTOR, TAGGER, PARSER — but we also 
allow the distribution on types to change from one iteration to the other. This is 
because the set of hidden events allowed for a given observed word sequence is not 
invariant — as it is the case in regular EM. For example, the count set that describes 
the WORD-PREDICTOR component of the model to be used at the next iteration is 
going to have a different n-gram composition than that used at the current iteration. 
This change is presented in the experiments section, see Table 4.4. 

3.2.1 First Stage Initial Parameters 

Each model component — WORD-PREDICTOR, TAGGER, PARSER — is ini- 
tialized from a set of hand-parsed sentences — in this case are going to use the 
UPenn Treebank manually annotated sentences — after undergoing headword perco- 
lation and binarization, as explained in section 2.1.1. This is a subset — approx. 90% 
— of the training data. Each parse tree {W, T) is then decomposed into its derivation 
d{W,T). Separately for each m model component, we: 
^normally required in the E-step 
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• gather joint counts C^"^\u'^'^\ z^"^^) from the derivations that make up the "de- 
velopment data" using p{W,T) — 1 (see appendix D); 

• estimate the interpolation coefficients on joint counts gathered from "check 
data" — the remaining 10% of the training data — using the EM algorithm [14]. 

These are the initial parameters used with the reestimation procedure described 
in the previous section. 

3.3 Second Stage Parameter Reestimation 

In order to improve performance, we develop a model to be used strictly for word 
prediction in (2.9), different from the WORD-PREDICTOR model (2.3). We will 
call this new component the L2R- WORD-PREDICTOR. 

The key step is to recognize in (2.9) a hidden Markov model (HMM) with fixed 
transition probabilities — although dependent on the position in the input sentence 
k — specified by the p{Wk, T^) values. 

The E-step of the EM algorithm [14] for gathering joint counts C^"^\y^^\xy^'^)^ 
m = L2R-W0RD-PREDICT0R-M0DEL, is the standard one whereas the M-step 
uses the same count smoothing technique as that described in section 3.2. 

The second reestimation pass is seeded with the m — WORD-PREDICTOR model 
joint counts C^'^\y^'^\x^"^^) resulting from the first parameter reestimation pass (see 
section 3.2). 
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Chapter 4 

Experiments using the Structured 
Language Model 

For convenience, we chose to work on the UPenn Treebank corpus [21] — a subset 
of the WSJ (Wall Street Journal) corpus. The vocabulary sizes were: 
word vocabulary: 10k, open — all words outside the vocabulary are mapped to the 
<unk> token; POS tag vocabulary: 40, closed; non-terminal tag vocabulary: 52, 
closed; parser operation vocabulary: 107, closed. The training data was split into 
development set (929,564wds (sections 00-20)), check set (73,760wds (sections 21- 
22)) and the test data consisted of 82,430wds (sections 23-24). The "check" set was 
used strictly for initializing the model parameters as described in section 3.2.1; the 
"development" set was used with the reestimation techniques described in chapter 3. 

4.1 Perplexity Results 

Table 4.1 shows the results of the reestimation techniques; EO-3 and L2R0-5 de- 
note iterations of the reestimation procedure described in sections 3.2 and 3.3, respec- 
tively A deleted interpolation trigram model derived from the same training data 
had perplexity 167.14 on the same test data. 
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iteration 
number 


DEV set 
L2R-PPL 


TEST set 
L2R-PPL 


EO 


24.70 


167.47 


El 


22.34 


160.76 


E2 


21.69 


158.97 


E3 = L2R0 


21.26 


158.28 


L2R5 


17.44 


153.76 



Table 4.1: Parameter reestimation results 



Simple linear interpolation between our model and the trigram model: 

Q{wk^i/Wk) = A • P{wk+i/wk-i,Wk) + (1 - A) • P{wk+i/Wk) 

yielded a further improvement in PPL, as shown in Table 4.2. The interpolation 
weight was estimated on check data to be A = 0.36. An overall relative reduction of 

11% over the trigram model has been achieved. 



iteration 
number 


TEST set 
L2R-PPL 


TEST set 
3-gram interpolated PPL 


EO 


167.47 


152.25 


E3 


158.28 


148.90 


L2R5 


153.76 


147.70 



Table 4.2: Interpolation with trigram results 



As outlined in section 2.6, the perplexity value calculated using (2.8): 

n 

P{W\T*) = l[P{wk+i\WkT;:),T*^argmaxTP{W,T) 

k=0 

is a lower bound for the achievable perplexity of our model; for the above search 
parameters and E3 model statistics this bound was 99.60, corresponding to a relative 
reduction of 41% over the trigram model. This suggests that a better parameterization 
in the PARSER model — one that reduces the entropy H{p(Tk\Wk)) of guessing the 
"good" parse given the word prefix — can lead to a better model. Indeed, as we 
already pointed out, the trigram model is a particular case of our model for which the 
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parse is always right branching and we have no POS/NT tag information, leading to 
H{p{Tk\Wk)) = and a standard 3-gram WORD-PREDICTOR. The 3-gram model 
is thus an extreme case of the structured language model: one for which the "hidden" 
structure is a function of the word prefix. Our result shows that better models can 
be obtained by allowing richer "hidden" structure — parses — and that a promising 
direction of research may be to find the best compromise between the predictive 
power of the WORD-PREDICTOR — measured by H{wk+i\Tk, Wk))— and the ease 
of guessing the hidden structure Tjt|H4 — measured by H{p{Tk\Wk)) — on which the 
WORD-PREDICTOR operation is based. A better solution would be a maximum 
entropy PARSER model which incorporates a richer set of predictors in a better 
way than the deleted interpolation scheme we are using. Due to the computational 
problems faced by such a model we have not pursued this path although we consider 
it a very promising one. 

4.1.1 Comments and Experiments on Model Parameters Rees- 
timation 

The word level probability assigned to a training/test set by our model is cal- 
culated using the proper word-level probability assignment in equation (2.9). An 
alternative which leads to a deficient probability model is to sum over all the com- 
plete parses that survived the pruning strategy, formalized in equation (2.11). Let 
the hkelihood assigned to a corpus C by our model Pe be denoted by: 

• jC^'^^{C, Pg), where Pg is calculated using (2.9), repeated here for clarity: 

P{wk+i\Wk) = Pi'>^k+i\Wkn)-p{Wk,Tk), 

p{Wk,Tk) = P{WkTk)/ n^kTk) 

TkeSk 

Note that this is a proper probability model. 

• C^{C,Pg), where Pg is calculated using (2.11): 

N 

P(W) = ^p(w^,t(^)) 
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This is a deficient probability model. 

One seeks to devise an algorithm that finds the model parameter values which 
maximize the likelihood of a test corpus. This is an unsolved problem; the standard 
approach is to resort to maximum likelihood estimation techniques on the training 
corpus and make provisions that will ensure that the increase in likelihood on training 
data carries over to unseen test data. 

As outhned previously, the estimation procedure of the SLM parameters takes 
place in two stages: 

1. the "N-best training" algorithm (see Section 3.2) is employed to increase the 
training data "likehhood" C^{C,Pe). The initial parameters for this first esti- 
mation stage are gathered from a treebank. The perplexity is still evaluated 
using the formula in Eq. (2.9). 

2. estimate a separate L2R- WORD-PREDICTOR model such that C^'^^C, Pq) 
is increased — see Eq. (2.12). The initial parameters for the L2R- WORD- 
PREDICTOR component are obtained by copying the WORD-PREDICTOR 
estimated at stage one. 

As explained in Section 4.1.1, the "N-best training" algorithm is employed to 
increase the training data "likehhood" C^{C, Pe); we rely on the consistency of the 
probability estimates underlying the calculation of the two different likelihoods to 
correlate the increase in C^{C, Pe) with the desired increase of C^'^^{C, Pq). 

To be more specific, C^{C, Pg) and £,^'^^{C, Pg) are calculated using the probability 
assignments in Eq. (2.11) — deficient — and Eq. (2.9), respectively. Both probability 
estimates arc consistent in the sense that if we summed over all the parses T for a 
given word sequence W they would yield the correct probability P{W) according to 
our model. Although there is no formal proof, there are reasons to believe that the 
N-best reestimation procedure should not decrease the C^{C, Pe) likelihood ^ but no 
claim can be made about the increase in the C^'^^{C, Pe) likehhood — which is the one 

^It is very similar to a rigorous EM approach 
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we are interested in. Our experiments show that the increase in C^{C, Pq) is corre- 
lated with an increase in C^'^^{C, Pq), a key factor in this being a good heuristic search 
strategy — see Section 2.5. Table 4.3 shows the evolution of different "perplexity" val- 
ues during N-best reestimation. L2R-PPL is calculated using the proper probability 
assignment in Eq.(2.9). TOP-PPL and BOT-PPL are calculated using the probability 
assignment in Eq.(2.8), where T* = argmaxTP(W,T) and T* = argminTP{W,T), 
respectively — the search for T* being carried out according to our pruning strategy; 
we condition the word predictions on the topmost and bottom-most parses present 
in the stacks after parsing the entire sentence. SUM-PPL is calculated using the 
deficient probability assignment in Eq.(2.11). It can be noticed that TOP-PPL and 
BOT-PPL stay almost constant during the reestimation process; The value of TOP- 
PPL is slightly increasing and that of BOT-PPL is shghtly decreasing. As expected, 
the value of the SUM-PPL decreases and its decrease is correlated with that of the 
L2R-PPL. 



"Perplexity" 


Iteration 


Relative Change 




EO 


E3 




TOP-PPL 


97.5 


99.3 


+1.85% 


BOT-PPL 


107.9 


106.2 


-1.58% 


si;m-ppl 


195.1 


175.5 


-10.05%. 


L2R-PPL 


167.5 


158.3 


-5.49% 



Table 4.3: Evolution of different "perplexity" values during training 



It is very important to note that due to both the smoothing involved in the M-stcp 

— imposed by the smooth parameterization of the modeP — and the fact that the set 

of sampled "N-best" hidden events — parses — are reevaluated at each iteration, we 

allow new maximal order events to appear in each model component while discarding 

others. Not only are we estimating the counts of maximal order n-gram events in each 

model component — WORD-PREDICTOR, TAGGER, PARSER — but we also allow 

^Unlike standard parameterizations, we do not reestimate the relative frequencies from which 
each component probabiUstic model is derived; that would lead to a shrinking or, at best, fixed set 
of events 
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the distribution on types to change from one iteration to the other. This is because 
the set of hidden events allowed for a given observed word sequence is not invariant. 
For example, the count set that describes the WORD-PREDICTOR component of 
the model to be used at the next iteration may have a different n-gram composition 
than that used at the current iteration. 

We evaluated the change in the distribution on types'^ of the maximal order events 
one iteration to the next. Table 4.4 shows the dynamics of the set 
of types of the different order events during the reestimation process for the WORD- 
PREDICTOR model component. Similar dynamics were observed for the other two 
components of the model. The equivalence classifications corresponding to each order 
is: 

• z — ho.tag, ho-word, h^i.tag, h^i.word for order 4; 

• z = ho.tag, ho.word, h-i.tag for order 3; 

• z — ho.tag, ho.word for order 2; 

• z — ho.tag for order 1; 

An event of order consists of the predicted word only. 



iteration 


no. tokens 


no. types for order 









1 


2 


3 


4 


EO 


929,564 


9,976 


77,225 


286,329 


418,843 


591,505 


El 


929,564 


9,976 


77,115 


305,266 


479,107 


708,135 


E2 


929,564 


9,976 


76,911 


305,305 


482,503 


717,033 


E3 


929,564 


9,976 


76,797 


307,100 


490,687 


731,527 


L2R0 (=E3) 


929,564 


9,976 


76,797 


307,100 


490,687 


731,527 


L2R1-5 


929,564 


9,976 


257,137 


2,075,103 


3,772,058 


5,577,709 



Table 4.4: Dynamics of WORD-PREDICTOR distribution on types during 
reestimation 



type is a particular value, regarded as one entry in the alphabet spanned by a given random 
variable 



54 



The higher order events — closer to the root of the hnear interpolation scheme in 
Figure 2.11 — become more and more diverse during the first estimation stage, as 
opposed to the lower order events. This shows that the "N-best" parses for a given 
sentence change from one iteration to the next. Although the EO counts were col- 
lected from "1-best" parses — binarized treebank parses — the increase in number 
of maximal order types from EO to El — collected from "N-best" , N = 10 — is far 
from dramatic, yet higher than that from El to E2 — both collected from "N-best" 
parses. 

The big increase in number of types from E3 (=L2R0) to L2R1 is due to the fact 
that at each position in the input sentence, WORD-PREDICTOR counts are now 
collected for all the parses in the stacks, many of which do not belong to the set of 
N-best parses for the complete sentence used for gathering counts during EO-3. 

Although the perplexity on test data still decreases during the second reestimation 
stage — we are not over-training — this decrease is very small and not worth the 
computational effort if the model is linearly interpolated with a 3-gram model, as 
shown in Table 4.2. Better integration of the 3- gram and the head predictors is 
desirable. 

4.2 Miscellaneous Other Experiments 

4.2.1 Choosing the Model Components Parameterization 

The experiments presented in [8] show the usefulness of the two most recent ex- 
posed heads for word prediction. The same criterion — conditional perplexity — 
can be used as a guide in selecting the parameterization of each model component: 
WORD-PREDICTOR, TAGGER, PARSER. For each model component we gather 
the counts from the UPenn Treebank as explained in Section 3.2.1. The relative 
frequencies are determined from the "development" data, the interpolation weights 
estimated on "check" data — as described in Section 3.2.1. We then test each model 
component on counts gathered from the "test" data. Note that the smoothing scheme 
described in Section 2.4 discards elements of the context z from right to left. 
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Selecting the WORD-PREDICTOR Equivalence Classification 

The experiments in [8] were repeated using deleted interpolation as a modeling tool 
and the training/testing setup described above. The results for different equivalence 
classifications of the word-parse k-prefix {Wk,Tf;) are presented in Table 4.5. The 



Equivalence Classification 


Cond. PPL 


Voc. Size 


HH 


z — 


ho.tag, ho-word, h^i.tag, h-i.word 


115 


10,000 


WW 


z — 


W-i.tag, W-i.word, W-2-tag, W-2-word 


156 


10,000 


hh 


Z. = 


ho-word, h-i.word 


154 


10,000 


WW 


1 = 


W-i.word, w_2-word 


167 


10,000 



Table 4.5: WORD-PREDICTOR conditional perplexities 



different equivalence classifications of the word-parse k-prefix retain the following 
predictors: 

1. ww: the two previous words — regular 3-gram model; 

2. hh: the two most recent exposed headwords — no POS/NT label information; 

3. WW: the two previous exposed words along with their POS tags; 

4. HH: the two most recent exposed heads — headwords along with their NT/POS 

labels; 

It can be seen that the most informative predictors for the next word are the exposed 
heads — HH model. Except for the ww model^, none of the others is a valid word- 
level perplexity since it conditions the prediction on hidden information (namely the 
tags present in the treebank parses); the entropy of guessing the hidden information 
would need to be factored in. 

Selecting the TAGGER Equivalence Classification 

The results for different equivalence classifications of the word-parse k-prefix {Wk, T^) 
for the TAGGER model are presented in Table 4.6. The different equivalence classi- 
^regular 3-gram model 
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Equivalence Classification 


Cond. PPL 


Voc. Size 


HHw 


z_ 


= Wk, hQ.tag, ho-word, h^i.tag, h-i.word 


1.23 


40 


WWw 




= Wk, W-i.tag, W-i.word, W-2-tag, W-2.word 


1.24 


40 


ttw 


z 


= Wk, ho-tag, h-i.tag 


1.24 


40 



Table 4.6: TAGGER conditional perplexities 



fications of the word-parse k-prefix retain the following predictors: 

1. WWw: the two previous exposed words along with their POS tags and the word 
to be tagged; 

2. HHw: the two most recent exposed heads — headwords along with their NT/POS 
labels and the word to be tagged; 

3. ttw: the NT/POS labels of the two most recent exposed heads and the word to 
be tagged; 

It can be seen that among the equivalence classifications considered, none performs 
significantly better than the others, and the prediction of the POS tag for a given 
word is a relatively easy task — the conditional perplexities are very close to one. 
Because of its simplicity, we chose to work with the ttw equivalence classification. 

Selecting the PARSER Equivalence Classification 

The results for different equivalence classifications of the word-parse k-prefix {Wk, Tk) 
for the PARSER model are presented in Table 4.7. The different equivalence classi- 



Equivalence Classification 


Cond. PPL 


Voc. Size 


HH 


z 


— hQ.tag, ho-word, h-i.tag, 


word 


1.68 


107 


hhtt 


z 


= ho-tag, h-i.tag, ho.word, 


word 


1.54 


107 


tt 


z 


= ho-tag, h-i-tag 




1.71 


107 



Table 4.7: PARSER conditional perplexities 



fications of the word-parse k-prefix retain the following predictors: 
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1. HH: the two most recent exposed heads — headwords along with their NT/POS 
labels and the word to be tagged; 

2. hhtt: same as HH just that the backing-off order is changed; 

3. ttw: the NT/POS labels of the two most recent exposed heads; 

It can be seen that the presence of headwords improves the accuracy of the PARSER 
component; also, the backing-off order of the predictors is important — hhtt vs. HH. 
We chose to work with the hhtt equivalence classification. 

4.2.2 Fudged TAGGER and PARSER Scores 

The probabihty values for the three model components fall into different ranges. 
As pointed out at the beginning of this chapter, the WORD-PREDICTOR vocabulary 
is of the order of thousands whereas the TAGGER and PARSER have vocabulary 
sizes of the order of tens. This leads to the undesirable effect that the contribution of 
the TAGGER and PARSER to the overall probability of a given partial parse P{W, T) 
is very small compared to that of the WORD-PREDICTOR. We explored the idea 
of bringing the probability values into the same range by fudging the TAGGER and 
PARSER probability values, namely: 

P{W, T) = 

n+l 
k=l 

P(T^_,\Wk-in-i) = n P{Pi\Wk-in-i, Wk, h,p\ . . .pti) (4-2) 

i=l 

where 7 is the fudge factor. For ^ ^ 1.0 we do not have a valid probability assignment 
anymore, however the L2R-PPL calculated using Eq. (2.9) is still a vahd word-level 
probability assignment due to the re-normalization of the interpolation coefficients. 
Table 4.8 shows the PPL values calculated using Eq. (2.9) where P(W, T) is calculated 
using Eq. (4.1). As it can be seen the optimal fudge factor turns out to be 1.0, 
corresponding to the correct calculation of the probability P{W,T). 
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fudge 


0.01 


0.02 0.05 


0.1 0.2 


0.5 


1.0 2.0 5.0 


10.0 


20.0 50.0 


100.0 


PPL 


341 


328 296 


257 210 


168 


167 189 241 


284 


337 384 


408 



Table 4.8: Perplexity Values: Fudged TAGGER and PARSER 



4.2.3 Maximum Depth Factorization of the Model 

The word level probability assignment used by the SLM — Eq. (2.9) — can be 
thought of as a model factored over different maximum reach depths. Let D{Tk) be 
the "depth" in the word-prefix Wk at which the headword h-i.word can be found. 

Eq. (2.9) can be rewritten as: 

d=k 

P{wk^r\Wk) = Y.P{d\^k)-P{wk+i\Wk,d), (4.3) 

d=Q 

where: 

P{d\Wk) = P(^k,Tk)-5{D{Tk),d) 
P{wk+i\Wk,d) = J2 PiTk\Wk,d)-P{wk+i\Wk,n) 

p(n\Wk,d) = p(Wk,n)-5(D(n),d)/p(d\Wk) 

We can interpret Eq. (4.3) as a linear interpolation of models that reach back to 
different depths in the word prefix Wk- The expected value of D{Tk) shows how far 
does the SLM reach in the word prefix: 

k=N d=k 

Eslm[D] = l/NY.Y.d-P{d\Wk) (4.4) 

k=0 d=0 

For the 3- gram model we have Es^gj-amiD] — 2. We evaluated the expected depth 
of the SLM using the formula in Eq. (4.4). The results are presented in Table 4.9. 

It can be seen that the memory of the SLM is considerably higher than that of the 

3-gram model — whose depth is 2. 

Figure 4.1 shows ^ the distribution P{d\Wk), averaged over all positions k in the 

^The nonzero value of is due to the fact that the prediction of the first word in a sentence 

is based on context of length 1 in both SLM and 3-gram models 
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iteration 


expected depth 


number 


E[D] 


EO 


3.35 


El 


3.46 


E2 


3.45 



Table 4.9: Maximum Depth Evolution During Training 



test string: 



N 



Pid\W) = l/Nj2Pid\Wk 



k=l 



Depth distribution according to P(T/W) 



0.6 



0.4 



0.3 



0.1 



E[deptln(EO)] = 3.35 
E[deptln(E1)] = 3.46 



10 15 
depth 



Figure 4.1: Structured Language Model Maximum Depth Distribution 



It can be seen that the SLM makes a prediction which reaches farther than the 
3-gram model in about 40% of cases, on the average. 
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Chapter 5 

A* Decoder for Lattices 



5.1 Two Pass Decoding Techniques 

In a two-pass recognizer, a computationally cheap decoding step is run in the 
first pass, a set of hypotheses is retained as an intermediate result and then a more 
sophisticated recognizer is run over these in a second pass — usually referred to as the 
rescoring pass. The search space in the second pass is much more restricted compared 
to the first pass so we can afford using better — usually also computationally more 
intensive — acoustic and/or language models. 

The two most popular two-pass strategies differ mainly in the number of interme- 
diate hypotheses saved after the first pass and the form in which they are stored. 

In the so-called "N-best^ rescoring" method, a list of complete hypotheses along 
with acoustic/language model scores are retained and then rescored using more com- 
plex acoustic/language models. 

Due to the hmited number of hypotheses in the N-best list, the second pass rec- 
ognizer might be too constrained by the first pass so a more comprehensive list of 
hypotheses is often needed. The alternative preferred to N-best list rescoring is "lat- 
tice rescoring" . The intermediate format in which the hypotheses arc stored is now a 
directed acyclic graph in which the nodes are a subset of the language model states in 
the composite hidden Markov model and the arcs are labeled with words. Typically, 

^The value of N is typically 100-1000 
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the first pass acoustic/language model scores associated with each arc — or hnk — 
in the lattice are saved and the nodes contain time alignment information. 

For both cases one can calculate the "oracle" word error rate: the word error rate 
along the hypothesis with the minimum number of errors. The oracle- WER decreases 
with the number of hypotheses saved. 

Of course, a set of N-best hypotheses can be assembled as a lattice, the difference 
between the two being just in the number of different hypotheses — with different 
time-alignments — stored in the lattice. One reason which makes the N-best rescor- 
ing framework attractive is the possibility to use "whole sentence" language models: 
models that are able to assign a score only to complete sentences due to the fact that 
they do not operate in a left-to-right fashion. The drawbacks are that the number of 
hypotheses explored is too small and their quality reminiscent of the models used in 
the first pass. To clarify the latter assertion, assume that the second pass language 
model to be applied is dramatically different from the one used in the first pass and 
that if we afforded to extract the N-best using the better language model they would 
have a different kind of errors, specific to this language model. In that case simple 
rescoring of the N-best list generated using the weaker language model may constrain 
too much the stronger language model used in the second pass, not allowing it to 
show its merits. 

It is thus desirable to have a sample of the possible word hypotheses which is as 
complete as possible — not biased towards a given model — and at the same time of 
manageable size. This is what makes lattice rescoring the chosen method in our case, 
hoping that simply by increasing the number of hypotheses retained one reduces the 
bias towards the first pass language model. 

5.2 A* Algorithm 

The A* algorithm [22] is a tree search strategy that could be compared to depth- 
first tree-traversal: pursue the most promising path as deeply as possible. 



62 



Let a set of hypotheses 

L — {h : Xi, . . . , Xn}, Xi e W* V i 

be organized as a prefix tree. We wish to obtain the maximum scoring hypothesis 
under the scoring function / : W* —>■ 

h* = argmax/(/i) 

without scoring all the hypotheses in L, if possible with a minimal computational 
effort. 

The algorithm operates with prefixes and suffixes of hypotheses in the set L; 
we will denote prefixes — anchored at the root of the tree — with x and suffixes 
— anchored at a leaf — with y. A complete hypothesis h can be regarded as the 
concatenation of a a; prefix and a y suffix: h = x.y. We assume that the function /(•) 
can be evaluated at any prefix x, i.e. f{x) is a meaningful quantity. 

To be able to pursue the most promising path, the algorithm needs to evaluate 
all the possible suffixes for a given prefix x — Wi, . . . ,Wp that are allowed in L — 
see figure 5.1. Let Cl{x) be the set of suffixes allowed by the tree for a prefix x and 
assume we have an overestimate for the f{x.y) score of any complete hypothesis x.y, 
9{x.y): 

gix.y) = f{x) + h{y\x) > f{x.y) 
Imposing the condition that h{y\x) — for empty y, we have 

g{x) = f{x),W complete x E L 

that is, the overestimate becomes exact for complete hypotheses h E L. Let the A* 
ranking function Qhix) be defined as: 

ghi^) = max g{x.y) — f{x) + hilx), where (5-1) 
hL{x) = max h{y\x) (5-2) 

yeCL{x) 

ghix) is an overestimate for the /(•) score of any complete hypothesis that has the 
prefix X] the overestimate becomes exact for complete hypotheses: 
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Figure 5.1: Prefix Tree Organization of a Set of Hypotheses L 



QLix) > f{x.y),yyeCL{x) (5.3) 
gL{h) = /(/i), V complete h E L (5.4) 

The A* algorithm uses a potentially infinite stack^ in which prefixes x are ordered 
in decreasing order of the A* ranking function gL{x)^; at each extension step the 
top- most prefix x = wi, ... ,Wp is popped form the stack, expanded with all possible 
one-symbol continuations of a; in L and then all the resulting expanded prefixes — 
among which there may be complete hypotheses as well — arc inserted back into the 
stack. The stopping condition is: whenever the popped hypothesis is a complete one, 
retain it as the overall best hypothesis h* — see Algorithm 5. 

The justification for the correctness of the algorithm lies in the fact that upon 
completion, any other prefix x in the stack has a lower stack-score than h*: 

gL{x) < gL{h*) = f{h*) 

But Ql^x) > f{x.y),Wy G Cl{x) which means that no complete hypothesis x.y could 
^The stack need not be larger than \L\ = n 

^In fact any overestimate satisfying both Eq. (5.3) and (5.4) wiU ensure correctness of the 
algorithm 
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/ / empty_hypothesis ; 
//top_most_hypothesis; 
/ / a_hypothesis ; 

insert empty_hypothesis in stack; 
do 

{ // one Astar extension step 

top_most_hypothesis = pop top-most hypothesis from stack; 

for all possible one sjraibol continuations w of top_most_hypothesis 
{ 

a_hypothesis = expand top_most_hypothesis with w; 
insert a_hypothesis in stack; 

} 

}while(top_most_hypothesis is incomplete) 
//top_most_hypothesis is the highest f(.) scoring one 

Algorithm 5: A* search 

possibly result in a higher /(•) score than h*, formally: 

f{x.y) < gL{x) < gL{h*) = /(/i*), Vx e stack 

Since the stack is infinite, it is guaranteed to contain prefixes for all hypotheses h & L 
— see Algorithm 5 — which means that: 

fix.y) < gL{x) < gL{h*) = f{h*), Wx.y e L 

To get a better grasp of the workings of A* we examine two limiting cases: perfect 
estimation of the scoring function /() value along the most promising suffix for any 
given prefix, and no clue at all. 

In the first case we have g{x.y) = f{x) + h{ii\x) = f{x.y); notice that the A* 
ranking function becomes giix) = uiaxy(zc,^(x) f{x.y),\/y G Cl{x), which means that 
we are able to find the best continuation of the current prefix. This makes the 
entire A* algorithm pointless: for x being the empty hypothesis, we just calculate 
gL{x) and retain the complete "continuation" y — h* that yielded maximal giix)- 
The A* algorithm simply builds h* by traversing y left to right; the topmost entry 
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in the stack will always have score f{h*), differently distributed among x and y in 
x.y: f{x) + h{y\x) — f{h*). The number of A* extension steps (see Algorithm 5) 
will be equal to the length of h* making the search effort minimal. Notice that in 

this particular case a truncated stack at depth 1 suffices, suggesting that there is 
a correlation between the search effort and the goodness of the estimate in the A* 
ranking function. 

In the second case we can set h{y\x) = oo for y non-empty and, of course, h{y\x) = 
for empty y. This will make gL{x) — f{x), if x is complete and gL{x) — oo, if x 
is incomplete; any incomplete hypothesis will thus have a higher score than any 
complete hypothesis, causing A* to evaluate all the complete hypotheses in L hence 
degenerating into an exhaustive search; the search effort is maximal. 

In practice the h{y\x) function is chosen heuristically. 

5.2.1 A* for Lattice Decoding 

There are a few reasons that make A* appealing for our problem: 

• the lattice can be conceptually structured as a prefix tree of hypotheses — the 
time alignment is taken into consideration when comparing two word prefixes; 

• the algorithm operates with whole prefixes x, making it ideal for incorporating 
language models whose memory is the entire utterance prefix; 

• a reasonably good overestimate h{y\x) and an efficient way to calculate hL{x) 
are readily available using the n-gram model, as we will explain later. 

Before explaining our approach to lattice decoding using the A* algorithm, let us 
define a few terms. 

The lattices we work with retain the following information after the first pass: 

• time-alignment of each node; 

• for each link connecting two nodes in the lattice we retain: 

— word identity w{link); 
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— acoustic model score — log-probability of acoustic segment covered by the 
link given the word, log PAMiAilin'k)\w, link); to make this possible, the 
ending nodes of the link must contain all contextual information necessary 
for assigning acoustic model scores; for example, in a crossword triphone 
system, all the words labeling the links leaving the end node must have 
the same first phone; 

— n-gram language model score — log-probability of the word, logPNG{w\link) ; 
again, to make this possible, the start node of the link must contain the 
context (n — l)-gram — it is a state in the finite state machine describing 
the n-gram language model used to generate the lattice; we thus refer to 
lattices as bigram or trigram lattices depending on the order of the lan- 
guage model that was used for generating it. The size of the lattice grows 
exponentially fast with the language model order. 

The lattice has a unique starting and ending node, respectively. 

A link in the lattice is an arc connecting two nodes of the lattice. Two links are 
considered identical if and only if their word identity is the same and their starting 
and ending nodes are the same, respectively. 

A path p through the lattice is an ordered set of links Iq . . .In with the constraint 
that any two consecutive links cover adjacent time intervals: 

p = {lo . . . In '■ = . . . n — 1, ending jnode{li) = starting jnode{li+i)} (5.5) 

We will refer to the starting node of Iq as the starting node of path p and to the 
ending node of /„ as the ending node of path p. 

A partial path is a path whose starting node is the same as the starting node of the 
entire lattice and a complete path is one whose starting/ending nodes are the same 
as those of the entire lattice, respectively. 

With the above definitions, a lattice can be conceptually organized as a prefix 
tree of paths. When rescoring the lattice using a different language model than the 
one that was used in the first pass, we seek to find the complete path p — Iq . . .In 
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maximizing: 

n 

f{p) = Yli^'^dPAMih) + LM weight ■ logPLM{w{h)\w{lo) ■ ■ ■ w{h-i)) - log Pip] (5.6) 

i=0 

where: 

• logPAuih) is the acoustic model log-hkehhood assigned to hnk If, 

• logPLM{w{li)\w{lQ) . . .w{li-i)) is the language model log-probability assigned 
to link li given the previous links on the partial path Iq . . .If, 

• LMweight > is a constant weight which multiplies the language model score of 
a link; its theoretical justification is unclear but experiments show its usefulness; 

• log Pip > is the "insertion penalty"; again, its theoretical justification is 
unclear but experiments show its usefulness. 

To be able to apply the A* algorithm we need to find an appropriate stack entry 
scoring function ghix) where x is a partial path and L is the set of complete paths 
in the lattice. Going back to the definition (5.1) of gii-) we need an overestimate 
g{x.y) = f{x) + h{y\x) > f{x.y) for all possible y = Ik ... In complete continuations 
of X allowed by the lattice. We propose to use the heuristic: 

n 

h{y\x) = Yli^^dPAMik) + LMweight ■ {logPNaik) + logPcoMp) - logPip] 

i=k 

+LMweight ■ logPpiNAL ■ 5{k < n) (5.7) 
A simple calculation shows that if logPiMih) satisfies: 

logPNcik) + logPcoMP > logPLM{li)yii 

then gL{x) = f{x) + maXy^cL(,x)h{y\x) is a an appropriate choice for the A* stack 
entry scoring function. 

The justification for the logPcoMP term is that it is supposed to compensate for 
the per word difference in log-probability between the n-gram model NG and the 
superior model LM with which we rescore the lattice — hence logPcoMP > 0. Its 
expected value can be estimated from the difference in perplexity between the two 
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models LM and NG. Theoretically we should use a higher value than the maximum 
pointwise difference between the two models: 

logPcoMP > raax[logPLM{h\lo ■ ■ ■ k-i) - logPNoik)] 

Vli 

but in practice we set it by trial and error starting with the expected value as an 
initial guess. 

The logPpiNAL > term is used for practical considerations as explained in the 
next section. 

The calculation of gL{x) (5.1) is made very efficient after realizing that one can 
use the dynamic programming technique in the Viterbi algorithm [29]. Indeed, for 
a given lattice L, the value of Hl^x) is completely determined by the identity of the 
ending node of x\ a Viterbi backward pass over the lattice can store at each node the 
corresponding value of /iz,(a;) = hL{endingjnode{x)) such that it is readily available 
in the A* search. 

5.2.2 Some Practical Considerations 

In practice one cannot maintain a potentially infinite stack. We chose to control 
the stack depth using two thresholds: one on the maximum number of entries in 
the stack, called stack-depth-threshold and another one on the maximum log- 
probability difference between the top most and the bottom most hypotheses in the 
stack, called stack-logP-threshold. 

As glimpsed from the two limiting cases analyzed in Section (5.2), there is a 
clear interaction between the quality of the stack entry scoring function (5.1) and 
the number of hypotheses explored, which in practice has to be controlled by the 
maximum stack size. 

A gross overestimate used in connection with a finite stack may lure the search to 
a cluster of paths which is suboptimal — the desired cluster of paths may fall out of 
the stack if the overestimate happens to favor a wrong cluster. 

Also, longer prefixes — thus having shorter suffixes — benefit less from the per 
word logPcoMP compensation which means that they may fall out of a stack already 
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full with shorter hypotheses — which have high scores due to compensation. This is 
the justification for the IoqPpjnal term in the compensation function h{y\x) : the vari- 
ance var[log PLM{li\lo ■ ■ ■ k-i) — log PNcili)] is a finite positive quantity so the compen- 
sation is likely to be closer to the expected value E[logPLM{h\l'0 ■ ■ ■ k-i) —logP]\fG{h)] 
for longer y continuations than for shorter ones; introducing a constant logPpi]^AL 
term is equivalent to an adaptive logPcoMP depending on the length of the y suffix 
— smaller equivalent logPcoMP for long suffixes y for which E[logPLM{h\io ■ ■ ■ h-i) ~ 
logPNaih)] is a better estimate for logPcoMP than it is for shorter ones. 

Because the structured language model is computationally expensive, a strong 
limitation is being placed on the width of the search — controlled by the 
stack-depth-threshold and the stack-logP-threshold. For an acceptable search 
width — runtime — one seeks to tune the compensation parameters to maximize 
performance measured in terms of WER. However, the correlation between these 
parameters and the WER is not clear and makes the diagnosis of search problems 
extremely difficult. Our method for choosing the search parameters was to sample 
a few complete paths pi, . . . iPn from each lattice, rescore those paths according to 
the /(•) function (5.6) and then rank the h* path output by the A* search among 
the sampled paths. A correct A* search should result in average rank 0. In practice 
this doesn't happen but one can trace the topmost path p* in the offending cases — 
pV^*and/(p*)>/(/i*): 

• if a prefix of the p* hypothesis is still present in the stack when A* returns then 
the search failed strictly because of insufficient compensation; 

• if no prefix of p* is present in the stack then the incorrect search outcome was 
caused by an interaction between compensation and insufficient search width. 

The method we chose for samphng paths from the lattice was an N-best search 
using the n-gram language model scores; this is appropriate for pragmatic reasons — 
one prefers lattice rescoring to N-best list rescoring exactly because of the possibihty 

to extract a path that is not among the candidates proposed in the N-best list — as 
well as practical reasons — they are among the "better" paths in terms of WER. 
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Chapter 6 

Speech Recognition Experiments 

The set of experiments presented in Section 4.1 showed improvement in perplexity 
over the 3-gram language model. The experimental setup is however fairly restrictive 
and artificial when compared to a real world speech recognition task: 

• although the headword percolation and binarization procedure is automatic, 
the treebank used as training data was generated by human annotators; 

• albeit statistically significant, the amount of training data (approximatively 1 
million words) is small compared to that used for developing language models 
used in real world speech recognition experiments; 

• the word level tokenization of treebank text is different than that used in the 
speech recognition community, the former being tuned to facilitate linguistic 
analysis. 

In the remaining part of the chapter we will describe the experimental setup used 
for speech recognition experiments involving the structured language model, results 
and conclusions. The experiments were run on three different corpora — Switchboard 
(SWB), Wall Street Journal (WSJ) and Broadcast News (BN) — sampling different 
points of the speech recognition spectrum — conversational speech over telephone 
lines at one end and read grammatical text recorded in ideal acoustic conditions at 
the other end. 
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In order to evaluate our model's potential as part of a speech recognizer, we had 
to address as follows the problems outhned above: 

• manual vs. automatic parse trees There are two corpora for which there exist 
treebanks, although of hmited size: Wall Street Journal (WSJ) and Switchboard 
(SWB). The UPenn Treebank [21] contains manually parsed WSJ text. There 
also exists a small part of Switchboard which was manually parsed at UPenn 
— approx. 20,000 words. This allows the training of an automatic parser — we 
have used the Collins parser [11] for SWB and the Ratnaparkhi parser [26] for 
WSJ and BN — which is going to be used to generate an automatic treebank, 
possibly with a slightly different word-tokenization than that of the two manual 
treebanks. We evaluated the sensitivity of the structured language model to 
this aspect and showed that the reestimation procedure presented in Chapter 3 
is powerful enough to overcome any handicap arising from automatic treebanks. 

• more training data The availability of an automatic parser to generate parse 
trees for the SLM training data — used for initializing the SLM — opens the 
possibility of training the model on much more data than that used in the ex- 
periments presented in Section 4.1. The only limitations are of computational 
nature, imposed by the speed of the parser used to generate the automatic 
treebank and the efficiency and speed of the reestimation procedure for the 
structured language model parameters. As our experiments show, the reesti- 
mation procedure leads to a better structured model — under both measures of 
perplexity and word error rate^. In practice the speed of the SLM is the limiting 
factor on the amount of training data. For Switchboard we have only 2 million 
words of language modeling training data so this is not an issue; for WSJ we 
were able to accommodate only 20 million words of training data, much less 
than the 40 million words used by standard language models on this task; for 
BN the discrepancy between the baseline 3-gram and the SLM is even bigger, 
we were able to accommodate only 14 million words of training data, much less 
than the 100 million words used by standard language models on this task. 

^Reestimation is also going to smooth out peculiarities in the automatically generated treebank 
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• different tokenization We address this problem in the following section. 

6.1 Experimental Setup 

In order to train the structured language model (SLM) as described in Chapter 3 
we use parse trees from which to initialize the parameters of the model^ . Fortunately 
a part of the SWB/WSJ data has been manually parsed at UPenn [21], [10]; let us 
refer to this corpus as a Treebank. The training data used for speech recognition — 
CSR — is different from the Treebank in two aspects: 

• the Treebank is only a subset of the usual CSR training data; 

• the Treebank tokenization is different from that of the CSR corpus; among other 
spurious small differences, the most frequent ones are of the type presented in 
Table 6.1. 



Treebank 


CSR 


do n't 


don't 


it 's 


it's 


jones ' 


jones' 


i 'm 


i'm 


i 'U 


i'U 


i 'd 


i'd 


we 've 


we've 


you 're 


you're 



Table 6.1: Treebank — CSR tokenization mismatch 



Our goal is to train the SLM on the CSR corpus. 

Training Setup 

The training of the SLM model proceeds as follows: 

^The use of initial statistics gatlicrcd in a different way is an interesting direction of research; the 
convergence properties of the reestimation procedure become essential in such a situation 
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• Process the CSR training data to bring it closer to the Treebank format. We 
apphed the transformations suggested by Table 6.1; the resulting corpus will be 
called CSR-Treebank, although at this stage we only have words and no parse 

trees for it; 

• Transfer the syntactic knowledge from the Treebank onto the CSR-Treebank 
training corpus; as a result of this stage, CSR-Treebank is truly a "treebank" 
containing binarized and headword annotated trees: 

- for the SWB experiments we parsed the SWB-CSR- Treebank corpus using 
the SLM trained on the SWB- Treebank — thus using the SLM as a parser; 
the vocabulary for this step was the union between the SWB- Treebank 
and the SWB-CSR-Treebank closed vocabularies. The resulting trees are 
already binary and have headword annotation. 

— for the WSJ and BN experiments we parsed the WSJ-CSR- Treebank cor- 
pus using the Ratnaparkhi maximum entropy parser [26], trained on the 
UPenn Treebank data^. The resulting trees were binarized and annotated 
with headwords using the procedure described in Section 2.1.1. 

• Apply the SLM parameter reestimation procedure on the CSR-Treebank train- 
ing corpus using the parse trees obtained at the previous step for gathering 
initial statistics. 

Notice that we have avoided "transferring" the syntactic knowledge from the Tree- 
bank tokenization directly onto the CSR tokenization; the reason is that CSR word 
tokens like "he's" or "you're" cross boundaries of syntactic constituents in the Tree- 
bank corpus and the transfer of parse trees from the Treebank to the CSR corpus is 
far from obvious and likely to violate syntactic knowledge present in the treebank. 

■^Thc parser is mismatchcxi. the most important difference being the fact that in the training data 
of the parser numbers are written as "$123" whereas in the data to be parsed they are expanded 
to "one hundred twenty three dollars"; we rely on the SLM parameter reestimation procedure to 
smooth out this mismatch 



74 



Lattice Decoding Setup 

To be able to run lattice decoding experiments we need to bring the lattices — in 
CSR tokenization — to the CSR-Treebank format. The only operation involved in this 
transformation is splitting certain words into two parts, as suggested by Table 6.1. 
Each link whose word needs to be split is cut into two parts and an intermediate 
node is inserted into the lattice as shown in figure 6.1. The acoustic and language 
model scores of the initial link are copied onto the second new link. For all the 



s s 

s_time s_time i 




w_2, AMlnprob, NGlnprob 



e_time e_time 

Figure 6.1: Lattice CSR to CSR-Treebank Processing 



decoding experiments we have carried out, the WER is measured after undoing the 
transformations highlighted above; the reference transcriptions for the test data were 
not touched and the NIST SCLITE^ package was used for measuring the WER. 

The refinement of the SLM presented in Section 2.6, Eq. (2.12 — 2.13) was not used 
at all during the following experiments due to its low ratio of improvement versus 
computational cost. 

6.2 Perplexity Results 

As a first step we evaluated the perplexity performance of the SLM relative to that 
of a deleted interpolation 3-gram model trained in the same conditions. As outlined 
in the previous section, we worked on the CSR-Treebank corpus. 

^SCLITE is a standard program supplied by NIST for scoring speech recognizers 
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6.2.1 Wall Street Journal Perplexity Results 

We chose to work on the DARPA'93 evaluation HUBl test setup. The size of the 
test set is 213 utterances, 3446 words. The 20kwds open vocabulary and baseline 
3- gram model are the standard ones provided by NIST and LDC. 

As a first step we evaluated the perplexity performance of the SLM relative to 
that of a deleted interpolation 3-gram model trained under the same conditions: 
training data size 20Mwds (a subset of the training data used for the baseline 3-gram 
model), standard HUBl open vocabulary of size 20kwds; both the training data and 
the vocabulary were re-tokenized such that they conform to the Upenn Treebank 
tokenization. We have linearly interpolated the SLM with the above 3-gram model: 

P(-) = A • P3grami-) + (1 - A) • PsLMi') 

showing a 10% relative reduction over the perplexity of the 3-gram model. The results 
are presented in Table 6.2. The SLM parameter reestimation procedure^ reduces the 
PPL by 5% ( 2% after interpolation with the 3-gram model ). The main reduction 
in PPL comes however from the interpolation with the 3-gram model showing that 
although overlapping, the two models successfully complement each other. The inter- 
polation weight was determined on a hcld-out set to be A = 0.4. In this experiment 
both language models operate in the UPenn Treebank text tokenization. 



Language Model 


L2R Perplexity 


DEV set 


TEST set 


no int 


3-gram int 


Trigram 


33.0 


147.8 


147.8 


SLM; Initial stats (iteration 0) 


39.1 


151.9 


135.9 


SLM; Reestimated (iteration 1) 


34.6 


144.1 


132.8 



Table 6.2: WSJ-CSR-Treebank perplexity results 



■''Due to the fact that the parameter reestimation procedure for the SLM is computationally 
expensive we ran only a single iteration 
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6.2.2 Switchboard Perplexity Results 

For the Switchboard experiments the size of the training data was 2.29 Mwds; the 
size of the test data set aside for perplexity measurements was 28 Kwds — WS97 
DevTest [10]. We used a closed vocabulary of size 22Kwds. Again, we have also 
linearly interpolated the SLM with the deleted interpolation 3-gram basehne showing 
a modest reduction in perplexity: 

P{Wi\Wi_i) = A • P3-gram{Wi\'Wi-l,'Wi-2) + (1 - A) • PsLM{Wi\Wi_i) 

The interpolation weight was determined on a held-out set to be A = 0.4. The results 
are presented in Table 6.3. 



Language Model 


L2R Perplexity 


DEV set 


TEST set 


no int 


3-gram int 


Trigram 


22.53 


68.56 


68.56 


SLM; Seeded with Auto-Treebank 


23.94 


72.09 


65.80 


SLM: Roostiiiiatc(l(iteratioii 1) 


22.70 


7L01 


(35.35 



Table 6.3: SWB-CSR-Treebank perplexity results 



6.2.3 Broadcast News Perplexity Results 

For the Broadcast News experiments the size of the training data was 14 Mwds; 
the size of the test data set aside for perplexity measurements was 23150 wds — 
DARPA'96 HUB4 dcv-tcst. Wc used an open vocabulary of size GlKwds. Again, we 
have also linearly interpolated the SLM with the deleted interpolation 3-gram baseline 
built on exactly the same training data showing an overall 7% relative reduction in 
perplexity: 

P{Wi\Wi-i) = A • Ps-gram{Wi\Wi-l,Wi-2) (1 - A) • PsLM{Wi\Wi-i) 

The interpolation weight was determined on a held-out set to be A = 0.4. The results 
are presented in Table 6.4. 
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Language Model 


L2R Perplexity 


DEV set 


TEST set 


no int 


3-gram int 


Trigram 


35.4 


217.8 


217.8 


SLM; Seeded with Auto-Treebank 


57.7 


231.6 


205.5 


SLM; Reestimated (iteration 2) 


40.1 


221.7 


202.4 



Table 6.4: SWB-CSR-Treebank perplexity results 



6.3 Lattice Decoding Results 

Wc proceeded to evaluate the WER performance of the SLM using the A* lattice 
decoder described in Chapter 5. Before describing the experiments we need to make 
clear one point; there are two language model scores associated with each link in the 
lattice: 

• the language model score assigned by the model that generated the lattice, 
referred to as the LAT3-gram; this model operates on text in the CSR tokeniza- 
tion; 

• the language model score assigned by rescoring each link in the lattice with the 
deleted interpolation 3-gram built on the data in the CSR-Treebank tokeniza- 
tion, referred to as the TRBNK3-gram; 

6.3.1 Wall Street Journal Lattice Decoding Results 

The lattices on which we ran rescoring experiments were obtained using the stan- 
dard 20k (open) vocabulary language model (LAT3-gram) trained on more training 
data than the SLM — about 40Mwds. The deleted interpolation 3-gram model 
(TRBNK3-gram) built on much less training data — 20Mwds, same as SLM — and 
using the same standard open vocabulary — after re-tokenizing it such that it matches 
the UPenn Treebank text tokenization — is weaker than the one used for generating 
the lattices, as confirmed by our experiments. Consequently, we ran lattice rescoring 
experiments in two setups: 
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• using the language model that generated the lattice — LAT3-gram — as the 
basehne model; language model scores are available in the lattice. 

• using the TRBNK3-gram language model — same training conditions as the 
SLM; we had to assign new language model scores to each link in the lattice. 

The S-ffiam lattices we used have an "oracle" WER^ of 3.4%; the baseline WER 
is 13.7%, obtained using the standard 3-gram model provided by DARPA (dubbed 
LAT3-gram) — trained on 40Mwds and using a 20k open vocabulary. 

CompEirison between LAT3-gram and TRBNK3-gram 

A first batch of experiments evaluated the power of the two 3-gram models at 
our disposal. The LAT3-gram scores arc available in the lattice from the first pass 
and we can rescorc each link in the lattice using the TRBNK3-gram model. The 
Viterbi algorithm can be used to find the best path through the lattice according 
to the scoring function (5.6) where logPLui-) can be either of the above or a linear 
combination of the two. Notice that the linear interpolation of link language model 
scores: 

P{1) = A • Plats —gram (0 + (1 — A) • PTRBNK3-gram{l) 

doesn't lead to a proper probabilistic model due to the tokenization mismatch. In 
order to correct this problem we adjust the workings of the TRBNK3-gram to take 
two steps whenever a split link is encountered and interpolate with the correct LAT3- 
gram probability for the two links. For example: 

P{don't\x, y) = X- Plats- gram{don't\x, y) + 

(1 - A) • PTRBNK2,-gram{do\x, y) ■ PTRBNK3-gram{n't\y, cio)(6.1) 

The results arc shown in Table 6.5. The parameters in (5.6) were set to: LMweight = 16, 

logP_{IP} = 0, usual values for WSJ. 

^Thc "oracle" WER is calculated by finding the path with the least number of errors in each 
lattice 
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A 0.0 


0.2 


0.1 


0.6 


0.8 


1.0 


WER(%) 14.7 


14.2 


13.8 


13.7 


13.5 


13.7 



Table 6.5: 3-gram Language Model; Viterbi Decoding Results 

LATS-gram driven search using the SLM 

A second batch of experiments evaluated the performance of the SLM. The per- 
plexity results show that interpolation with the 3-gram model is beneficial for our 
model. The previous experiments show that the LAT3-gram model is more powerful 
than the TRBNK3-gram model. The interpolated language model score: 

P{1) — A • PLAT3-gram{l) + (1 — A) • PsLm{1) 

is calculated as explained in the previous section — see Eq. 6.1. 

The results for different interpolation coefficient values are shown in Table 6.6. 
The parameters controlling the SLM were the same as in Chapter 3. 

As explained previously, due to the fact that the SLM's memory extends over 
the entire prefix we need to apply the A* algorithm to find the overall best path 
in the lattice. The parameters controlling the A* search were set to: logPcoMP 
= 0.5, logPpjNAL = 0, LMweight = 16, logPip = 0, stack-depth-threshold=30, 
stack-depth-logP-threshold=100 (sec 5.6 and 5.7). 

The logPcoMP, ^ogPpiNAL and stack-depth-threshold, 
stack-depth-logP-threshold were optimized directly on test data for the best in- 
terpolation value found in the perplexity experiments. The LMweight, logPjp pa- 
rameters are the ones typically used with the 3-gram model for the WSJ task; we did 
not adjust them to try to fit the SLM better. 



A 


0.0 


0.4 


1.0 


WER(%) (iteration SLM ) 


14.4 


13.0 


13.7 


WER(%) (iteration 1 SLM ) 


14.3 


13.2 


13.7 



Table 6.6: LAT-3gram -|- Structured Language Model; A* Decoding Results 
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The structured language model achieved an absolute improvement in WER of 
0.7% (5% relative) over the basehne. 

TRBNK3-gram driven search using the SLM 

We rescored each link in the lattice using the TRBNK3-gram language model and 
used this as a baseline for further experiments. As showed in Table 6.5, the baseline 
WER becomes 14.7%. The relevance of the experiments using the TRBNK3-gram 
rescored lattices is somewhat questionable since the lattice was generated using a 
much stronger language model — the LAT3-gram. Our point of view is the following: 
assume that we have a set of hypotheses which were produced in some way; we then 
rescore them using two language models, Ml and M2; if model M2 is truly superior 
to Ml^, then the WER obtained by rescoring the set of hypotheses using model M2 
should be lower than that obtained using model Ml. 

We repeated the experiment in which we linearly interpolate the SLM with the 
3-gram language model: 

P{1) = A • PTRBNK^-gram{l) + (1 — A) • PsLm{1) 

for different interpolation coefficients. The A* search parameters were the same as 
before. The results are presented in Table 6.7. The structured language model inter- 



A 


0.0 


0.4 


1.0 


WER(%) (iteration SLM ) 


14.6 


14.3 


14.7 


WER(%) (iteration 3 SLM ) 


13.8 


14.3 


14.7 



Table 6.7: TRBNK-3gram + Structured Language Model; A* Decoding Results 

polated with the trigram model achieves 0.9% absolute (6% relative) reduction over 
the trigram baseline; the parameters controlling the A* search have not been tuned 
for this set of experiments. 

'"Prom a speech recognition perspective 
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6.3.2 Switchboard Lattice Decoding Results 

On the Switchboard corpus, the lattices for which we ran decoding experiments 
were obtained using a language model (LAT3-gram) trained in very similar conditions 
— roughly same training data size and vocabulary, closed over test data — to the ones 
under which the SLM and the basehne deleted interpolation 3-gram model (TRBNK3- 
gram) were trained. The only difference is the tokenization — CSR vs. CSR-Treebank, 
see Section 6.1 — which makes the LAT3-gram act as phrase based language model 
when compared to TRBNK3-gram. The experiments confirmed that LAT3-gram is 
stronger than TRBNK-3gram. 

Again, we ran lattice rescoring experiments in two setups: 

• using the language model that generated the lattice — LAT3-gram — as the 
basehne model; language model scores are available in the lattice. 

• using the TRBNK3-gram language model — same training conditions as the 
SLM; we had to assign new language model scores to each link in the lattice. 

Compcirison between LAT3-gram and TRBNK3-gram 

The results arc shown in Table 6.8, for different interpolation values: 

P{1) = A • Plats —gram (0 + (1 — A) • PTRBNK3-gram{l) 

The parameters in (5.6) were set to: LMweight = 12, logP_{IP} = 10. 



A 0.0 


0.2 


0.4 


0.6 


0.8 


1.0 


WER(%) 42.3 


41.8 


41.2 


41.0 


41.0 


41.2 



Table 6.8: 3-gram Language Model; Viterbi Decoding Results 



LAT3-gram driven search using the SLM 

The previous experiments show that the LAT3-gram model is more powerful than 
the TRBNK3-gram model. We thus wish to interpolate the SLM with the LAT3-gram 
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model: 

P{1) — A • PhAT^-gramil) + (1 — A) • PsLm{1) 

We correct the interpolation the same way as described in the WSJ experiments — 
see Section 6.3.1, Eq. 6.1. 

The parameters controlling the SLM were the same as in chapter 3. The parame- 
ters controUing the A* search were set to: logPcoMP — 0.5, IoqPfinal — 0, LMweight 
= 12, logPjp = 10, stack-depth-threshold=40, stack-depth-logP-threshold=100 
(see 5.6 and 5.7). The logPcoMP, IoqPfinal and stack-depth-threshold, 
stack-depth-logP-threshold were optimized directly on test data for the best in- 
terpolation value found in the perplexity experiments. In all other experiments they 
were kept fixed to these values. The LMweight, logPjp parameters are the ones 
typically used with the 3-gram model for the Switchboard task; we did not adjust 
them to try to fit the SLM better. 

The results for different interpolation coefficient values are shown in Table 6.9. 



A 


0.0 


0.1 


1.0 


WER(%) (SLM iteration 0) 


41.8 


40.7 


41.2 


WER(%) (SLM iteration 3) 


41.6 


40.5 


41.2 



Table 6.9: LAT-3gram -|- Structured Language Model; A* Decoding Results 

The structured language model achieved an absolute improvement of 0.7% WER 
over the basehne; the improvement is statistically significant at the 0.001 level ac- 
cording to a sign test at the sentence level. 

For tuning the search parameters wc have applied the N-best lattice sampling 
technique described in section 5.2.2. As a by-product, the WER performance of the 
structured language model on N-best list rescoring — N = 25 — was 40.4%. The 
average rank of the hypothesis found by the A* search among the N-best ones — after 
rescoring them using the structured language model interpolated with the trigram — 
was 0.3. There were 329 offending sentences — out of a total of 2427 sentences — in 
which the A* search lead to a hypothesis whose score was lower than that of the top 
hypothesis among the N-best(O-best). In 296 cases the prefix of the rescored 0-best 
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was still in the stack when A* returned — inadequate compensation — and in the 
other 33 cases, the 0-best hypothesis was lost during the search due to the finite stack 
size. 

TRBNK3-gram driven search using the SLM 

We rescored each link in the lattice using the TRBNK3-gram language model and 
used this as a baseline for further experiments. As showed in Table 6.8, the baseline 
WER is 42.3%. 

We then repeated the experiment in which we linearly interpolate the SLM with 
the 3-gram language model: 

P{1) — A • PTRBNKZ-gram{l) + (1 — A) • PsLm{1) 

for different interpolation coefficients. The parameters controlling the A* search 
were set to: logPcoMP = 0.5, logPpiNAL = 0, LMweight = 12, logPjp = 10, 
stack-depth-threshold=40, stack-depth-logP-threshold=100 (see 5.6 and 5.7). 
The results are presented in Table 6.10. The structured language model interpolated 



A 


0.0 


0.4 


1.0 


WER(%) (iteration SLM ) 


42.0 


41.6 


42.3 


WER(%) (iteration 3 SLM ) 


42.0 


41.6 


42.3 



Table 6.10: TRBNK-3gram + Structured Language Model; A* Decoding Results 

with the trigram model achieves 0.7% absolute reduction over the trigram baseline. 

6.3.3 Broadcast News Lattice Decoding Results 

The Broadcast News (BN) lattices for which we ran decoding experiments were 
obtained using a language model (LAT3-gram) trained on much more training data 
than the SLM; a typical figure for BN is lOOMwds. We could accommodate 14Mwds 
of training data for the SLM and the baseline deleted interpolation 3-gram model 
(TRBNK3-gram) . The experiments confirmed that LAT3-gram is stronger than 
TRBNK-3gram. 
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The set set on which we ran the experiments was the DARPA'96 HUB4 dev-test. 
We used an open vocabulary of 61kwds. Again, we ran lattice rescoring experiments 
in two setups: 

• using the language model that generated the lattice — LAT3-gram — as the 
basehne model; language model scores are available in the lattice. 

• using the TRBNK3-gram language model — same training conditions as the 
SLM; we had to assign new language model scores to each link in the lattice. 

The test set is segmented in different focus conditions summarized in Table 6.11. 



Focus 


Description 


FO 


baseline broadcast speech (clean, planned) 


Fl 


spontaneous broadcast speech (clean) 


F2 


low fidelity speech (typically narrowband) 


F3 


speech in the presence of background music 


F4 


speech under degraded acoustical conditions 


F5 


non-native speakers (clean, planned) 


FX 


all other speech (e.g. spontanous non-native) 



Table 6.11: Broadcast News Focus conditions 



Compcirison between LAT3-gram and TRBNK3-gram 

The results are shown in Table 6.12, for different interpolation values: 

P{1) = A • Plats —gram (0 + (1 - A) ■ PtRBN KZ-gram{^) 
The parameters in (5.6) were set to: LMweight = 13, logP_{IP} = 10. 



A 0.0 


0.2 


0.4 


0.6 


0.8 


1.0 


WER(%) 35.2 


34.0 


33.2 


33.0 


32.9 


33.1 



Table 6.12: 3-gram Language Model; Viterbi Decoding Results 
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LAT3-gram driven secirch using the SLM 

The previous experiments show that the LAT3-gram model is more powerful than 
the TRBNK3-gram model. We thus wish to interpolate the SLM with the LAT3-gram 
model: 

P{1) = A • PLAT3-gramil) + (1 - A) • PsLm{1) 

We correct the interpolation the same way as described in the WSJ experiments — 
see Section 6.3.1, Eq. 6.1. 

The parameters controlling the SLM were the same as in chapter 3. The parame- 
ters controlling the A* search were set to: logPcoMP = 0.5, logPpiNAL = 0, LMweight 
— 13, logPip — 10, stack-depth-threshold=25, stack-depth-logP-threshold=100 
(see 5.6 and 5.7). 

The results for different interpolation coefficient values are shown in Table 6.13. 
The breakdown on different focus conditions is shown in Table 6.14. The SLM achieves 



A 


0.0 


0.4 


1.0 


WER(%) (SLM iteration 0) 


34.4 


33.0 


33.1 


WER(%) (SLM iteration 2) 


35.1 


33.0 


33.1 



Table 6.13: LAT-3gram + Structured Language Model; A* Decoding Results 



A 


Decoder 


SLM iteration 


FO 


Fl 


F2 


F3 


F4 


F5 


FX 


overall 


1.0 


Viterbi 




13.0 


30.8 


42.1 


31.0 


22.8 


52.3 


53.9 


33.1 


0.0 


A* 





13.3 


31.7 


44.5 


32.0 


25.1 


54.4 


54.8 


34.4 


0.4 


A* 





12.5 


30.5 


42.2 


31.0 


23.0 


52.9 


53.9 


33.0 


1.0 


A* 





12.9 


30.7 


42.1 


31.0 


22.8 


52.3 


53.9 


33.1 


0.0 


A* 


2 


14.8 


31.7 


46.3 


31.6 


27.5 


54.3 


54.8 


35.1 


0.1 


.4* 


2 


12.2 


30.7 


12.0 


31.1 


22.5 


53.1 


51.1 


33.0 


1.0 


A* 


2 


12.9 


30.7 


42.1 


31.0 


22.8 


52.3 


53.9 


33.1 



Table 6.14: LAT-3gram + Structured Language Model; A* Decoding Results; break- 
down on different focus conditions 



0.8% absolute (6% relative) reduction in WER on the FO focus condition despite the 
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fact that the overall WER reduction is neghgible. We also note the beneficial effect 
training has on the SLM performance on the FO focus condition. 

TRBNK3-gram driven search using the SLM 

We rescored each hnk in the lattice using the TRBNK3-gram language model and 
used this as a baseline for further experiments. As showed in Table 6.12, the baseline 
WER is 35.2%. 

We then repeated the experiment in which we linearly interpolate the SLM with 
the 3-gram language model: 

P{1) — A • PTRBNK3-gram{l) + (1 — A) • PsLm{1) 

for different interpolation coefficients. The parameters controlling the A* search 
were set to: logPcoMP — 0.5, IoqPfinal — 0, LMweight — 13, logPjp — 10, 
stack-depth-threshold=25, stack-depth-logP-threshold=100 (see 5.6 and 5.7). 
The results are presented in Table 6.15. The breakdown on different focus conditions 



A 


0.0 


0.4 


1.0 


WER(%) (SLM iteration 0) 


35.4 


34.9 


35.2 


WER(%) (SLM iteration 2) 


35.0 


34.7 


35.2 



Table 6.15: TRBNK-3gram + Structured Language Model; A* Decoding Results 

is shown in Table 6.16. The SLM achieves 1.1% absolute (8% relative) reduction in 
WER on the FO focus condition and an overall WER reduction of 0.5% absolute. We 
also note the beneficial effect training has on the SLM performance. 

Conclusions to Lattice Decoding Experiments 

We note that the parameter reestimation doesn't improve the WER performance 
of the model in all cases. The SLM achieves an improvement over the 3-gram baseline 
on all three corpora: Wall Street Journal, Switchboard and Broadcast News. 
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A 


Decoder 


SLM iteration 


FO 


Fl 


F2 


F3 


F4 


F5 


FX 


overall 


1.0 


Viterbi 




14.5 


32.5 


44.9 


33.3 


25.7 


54.9 


56.1 


35.2 


0.0 


A* 





14.6 


32.9 


44.6 


33.1 


26.3 


54.4 


56.9 


35.4 


0.4 


A* 





14.1 


32.2 


44.4 


33.0 


25.0 


54.2 


56.1 


34.9 


1.0 


A* 





14.5 


32.4 


44.9 


33.3 


25.7 


54.9 


56.1 


35.2 


0.0 


A* 


2 


13.7 


32.4 


44.7 


32.9 


26.1 


54.3 


56.3 


35.0 


0.1 


.4* 


2 


13.4 


32.2 


11.1 


31.9 


25.3 


51.2 


56.2 


31.7 


1.0 


A* 


2 


14.5 


32.4 


44.9 


33.3 


25.7 


54.9 


56.1 


35.2 



Table 6.16: TRBNK-3gram + Structured Language Model; A* Decoding Results; 
breakdown on different focus conditions 



6.3.4 Taking Advantage of Lattice Structure 

As we shall see, in order to carry out experiments in which we try to take further 
advantage of the lattice, we need to have proper language model scores on each lattice 
link. For all the experiments in this section we used the TRBNK3-gram rescored 
lattices. 

Peeking Interpolation 

As described in Section 2.6, the probability assignment for the word at position 
A; + 1 in the input sentence is made using: 

P{wk+ilWk) = P^'^k+i/WkTk) ■ p{Wk,Tk) (6.2) 

where 

piWk^Tk) = PiW^Tk)/ PiWkTk) (6.3) 

Tk&Sk 

which ensures a proper probability over strings W*, where Sj- is the set of all parses 
present in the SLM stacks at the current stage k. 

One way to take advantage of the lattice is to determine the set of parses over 
which we are going to interpolate by knowing what the possible future words are — 
the links leaving the end node of a given path in the lattice bear only a small set of 
words — for our lattices, less than 10 on the average. The idea is that by knowing 
the future word it is much easier to determine the most favorable parse for predicting 
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it. Let Wl(p) denote the set of words that label the links leaving the end node of path 
p in lattice L. We can then restrict the set of parses Sk used for interpolation to: 

^pruned ^ . j.i ^ P{w' /W^T^) ■ ^(W^fe,^,), V G W^} 

We obviously have 5'^^""'^'^ c Sk- Notice that this does not lead to a correct proba- 
bility assignment anymore since it violates the causality implied by the left-to-right 
operation of the language model. In the extreme case of |Wl(p)| = 1 we have a 
model which, at each next word prediction step, picks from among the parses in 
only the most favorable one for predicting the next word. This leads to the undesir- 
able effect that at a subsequent prediction during the same sentence the parse picked 
may change, always trying to make the best possible current prediction. In order to 
compensate for this unwanted effect we decided to run a second experiment in which 
only the parses in 5'^'^""^°' are kept in the stacks of the structured language model at 
position k in the input sentence — the other ones are discarded and thus unavailable 
for later predictions in the sentence. This speeds up considerably the decoder — 
approximately 4 times faster than the previous experiment — and slightly improves 
on the results in the previous experiment but still does not increase the performance 
over the standard structured language model, as shown in Table 6.17. The results for 
the standard SLM do not match those in Table 6.10 due to the fact that in this case 
we have not applied the tokenization correction specified in Eq. (6.1), Section 6.3.1. 



A 


0.0 


0.2 


0.4 


0.6 


0.8 


1.0 


WER(%) (standard SLM) 


42.0 


41.8 


41.9 


41.5 


42.1 


42.5 


WER(%) (pecking SLM) 


42.3 






42.0 






WER(%) (pruned peeking SLM) 


42.1 






41.9 







Table 6.17: Switchboard;TRBNK-3gram + Peeking SLM; 
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Normalized Peeking 

Another proper probability assignment for the next word w^+i could be made 
according to: 

P{wk+i/Wk) = norm{a{w,Wk)), (6.4) 
where 

a{w,Wk) = msx P{w/Wkn)-p{Wk,Tk) (6.5) 
and 

norm{a{w,Wk)) = a{wk+i,Wk)/ ^ a{w,Wk) (6.6) 

wev 

The sum over all words in the vocabulary V — |V| ^ 20,000 — prohibits the 
use of the above equation in perplexity evaluations for computational reasons. In 
the lattice however we have a much smaller list of future words so the summation 

needs to be carried only over Wl{p) (see previous section) for a given path p. To take 
care of the fact that due to the truncation of V to VVl(p) the probability assignment 
now violates the left-to-right operation of the language model we can redistribute the 
3-gram mass assigned to >Vl(p) according to the formula proposed in Eq. (6.4): 

PsLMnormiWk+l/Wk{p)) = nOrm{a{w ,Wk)) ■ PTRENKS-gramC^Lip)) (6.7) 

a{w,Wk) = maxP{w/Wkn)-p{Wk,n) (6.8) 

norm{a{w,Wk)) = a{wk+i,Wk)/ Yl ^fe) (6-9) 

weWL{p) 

PTRBNKi-gram{y^L{p)) = PTRBNK3-gram{w /Wk{p)) (6.10) 

weWL(p) 

Notice that if we let Wl(p) = V wc get back Eq. (6.4). Again, one could discard 
from the SLM stacks the parses which do not belong to gvj'^^'^'^^ explained in the 
previous section. Table 6.18 presents the results obtained when linearly interpolating 
the above models with the 3-gram model: 

PillWk{p)) = A ■ PTRBNK3-gram{llWk{p)) + (1 - A) ■ PsLMnormH /Wk^p)) 

The results for the standard SLM do not match those in Table 6.10 due to the 
fact that in this case we have not applied the tokenization correction specified in 
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A 


0.0 


0.2 


0.1 


0.6 


0.8 


1.0 


WER(%) (standard SLM) 


42.0 


41.8 


4L9 


41.5 


42.1 


42.5 


WER(%) (normalized SLM) 


42.7 




42.1 


42.0 


42.1 




WER(%) (pruned normalized SLM) 








42.2 







Table 6.18: Switchboard; TRBNK-3gram + Normalized Peeking SLM; 



Eq. (6.1), Section 6.3.1. Although some of the experiments showed improvement over 
the WER baseline achieved by the 3-gram language model, none of them performed 
better than the standard structured language model linearly interpolated with the 
trigram model. 
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Chapter 7 

Conclusions and Future Directions 

7.1 Comments on Using the SLM as a Parser 

The structured language model could be used as a parser, namely select the most 
likely parse according to our pruning strategy: T* — argmaxTP{W,T). Due to the 
fact that the SLM allows parses in which the words in a sentence are not joined under 
a single root node — sec the definition of a complete parse and Figure 2.6 — a direct 
evaluation of the parse quality against the UPcnn Trccbank parses is unfair. However, 
a simple modification will constrain the parses generated by the SLM to join all words 
in a sentence under a single root node. 

Imposing the additional constraint that: 

• P{wk=</^>\Wk-iTk-i) = if h^i.tag ^ SB ensures that the end of sentence 
symbol </s> is generated only from a parse in which all the words have been 
joined in a single constituent. 

One important observation is that in this case one has to eliminate the second 
pruning step in the model and the hard pruning in the cache-ing of the CONSTRUC- 
TOR model actions; it is sufficient if this is done only when operating on the last 
stack vector before predicting the end of sentence </s>. Otherwise, the parses that 
have all the words joined under a single root node may not be present in stacks before 
the prediction of the </s> symbol, resulting in a failure to parse a given sentence. 
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7.2 Comparison with other Approaches 

7.2.1 Underlying P{W,T) Probability Model 

The actions taken by the model are very similar to a LR parser. However the 
encoding of the word sequence along with a parse tree {W, T) is different, proceeding 
bottom-up and interleaving the word predictions. This leads to a different probability 
assignment than that in a PCFG grammar — which is based on a different encoding 
of {W,T). 

A thorough comparison between the two classes of probabilistic languages — 
PCFGs and shift-reduce probabihstic push-down automata, to which the SLM per- 
tains — has been presented in [1]. 

Regarding {W^ T) as a graph, Figure 7.1 shows the dependencies in a regular CFG; 
in contrast. Figures (7.2-7.4) show the probabilistic dependencies for each model com- 
ponent in the SLM; a complete dependency structure is obtained by super-imposing 
the three figures. To make the SLM directly comparable with a CFG we discard the 
lexical information at intermediate nodes in the tree — headword annotation — thus 
assuming the following equivalence classifications in the model components — see 
Eq.(2.3-2.5): 

P{wk\Wk-iTk-i) ^ P{wk\[Wk-iTk-i]) ^P{wk\ho.tag,h-i.tag) (7.1) 
P{tk\wk, Wk-iTk-i) = P{tk\wk, [Wk-iTk-i]) = P{tk\wk, ho.tag, h^i.tag) (7.2) 

Pip'^lWkTk) = P{p^\[Wkn]) = P{p'^\ho.tag, h.^.tag) (7.3) 

It can be seen that the probabilistic dependency structure is more complex than 
that in a CFG even in this simplified SLM. 

Along the same lines, the approach in [19] regards the word sequence W with the 

parse structure T as a Markov graph {W, T) modeled using the CFG dependencies 
superimposed on the regular word-level 2-gram dependencies, showing improvement 
in perplexity over both 2-gram and 3-gram modeling techniques. 
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TOP 




Figure 7.1: CFG dependencies 

TOP 
TOP' 




Figure 7.2: Tag reduced WORD-PREDICTOR dependencies 
7.2.2 Language Model 

A structured approach to language modeling has been taken in [25]: the underly- 
ing probability model P{W, T) is a simple lexical link grammar, which is automatically 
induced and reestimated using EM from a training corpus containing word sequences 
(sentences). The model doesn't make use of POS/NT labels — which we found ex- 
tremely useful for word prediction and parsing. Another constraint is placed on the 
context used by the word predictor: the two words in the context used for word 
prediction are always adjacent; our models' hierarchical scheme allows the exposed 
headwords to originate at any two different positions in the word prefix. Both ap- 
proaches share the desirable property that the 3-gram model belongs to the parameter 
space of the model. 
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Figure 7.4: Tag reduced CONSTRUCTOR dependencies 



The language model we present is closely related to the one investigated in [7]^, 
however different in a few important aspects: 

• our model operates in a left-to-right manner, thus allowing its use directly in 
the hypothesis search for W in (1.1); 

• our model is a factored version of the one in [7], thus enabling the calculation 
of the joint probability of words and parse structure; this was not possible in 
the previous case due to the huge computational complexity of that model; 

• our model assigns probability at the word level, being a proper language model. 

^The SLM might not have happened at ah, weren't it for the work and creative environment in 
the WS96 Dependency Modeling Group and the authors' desire to write a PhD thesis on structured 
language modeling 
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The SLM shares many features with both class based language models [23] and 
skip n-gram language models [27]; an interesting approach combining class based 
language models and different order skip-bigram models is presented in [28]. It seems 
worthwhile to make two comments relating the SLM to these approaches: 

• the smoothing involving NT/POS tags in the WORD-PREDICTOR is similar 
to a class based language model using NT/POS labels for classes. We depart 
however from the usual approach by not making the conditional independence 
assumption P{wk+i\wk, class(wfc)) = P(wA:+i|class(wA:))- Also, in our model the 
"class" assignment — through the heads exposed by a given parse for the 
word prefix Wk and its "weight" p(M4,Tfc), see Eq. (2.9) — is highly context- 
sensitive — it depends on the entire word-prefix Wk — and is syntactically 
motivated through the operations of the CONSTRUCTOR. A comparison be- 
tween the hh and HH equivalence classifications in the WORD-PREDICTOR 
— see Table 4.5 — shows the usefulness of POS/NT labels for word prediction. 

• recalling the depth factorization of the model in Eq. (4.3), our model can be 
viewed as a skip n-gram where the probability of a skip P{dQ,di\Wk) — d^^di 
are the depths at which the two most recent exposed headwords /iq, hi can 
be found, similar to P{d\Wk) — is highly context sensitive. Notice that the 
hierarchical scheme for organizing the word prefix allows for contexts that do 
not necessarily consist of adjacent words, as in regular skip n-gram models. 

7.3 Future Directions 

We have presented an original approach to language modeling that makes use of 
syntactic structure. The experiments we have carried out show improvement in both 
perplexity and word error rate over current state-of-the-art techniques. Preliminary 
experiments reported in [30] show complementarity between the SLM and a topic 
language model yielding almost additive results — word error rate improvement — 
on the Switchboard task. Among the directions which we consider worth exploring 
in the future, are: 
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• automatic induction of the SLM initial parameter values; 

• better integration of the 3-gram model and the SLM; 

• better parameterization of the model components; 

• study interaction between SLM and other language modeling techniques such 
as cache and trigger or topic language models. 
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Appendix A 

Minimizing KL Distance is 
Equivalent to Maximum Likelihood 



Let friy) be the relative frequency probability distribution induced on y by 
the collection of training samples T; this determines the set of desired distributions 

= {p{X,Y) : p{Y) = fr{Y)}. Let Q{e) = {q0{X,Y) : 9 e 0} he the model 
space. 

Proposition 2 Finding the maximum likelihood estimate g e Q{Q) is equivalent to 
finding the pair {p, q) e Pr x Q{Q) which minimizes the KL-distance D{p || g). 

For a given pair (p, q) e Pr x Q{Q) we have: 
Dip\\q) = E P(^,y)log4^ 

f{y)-r{x\y)\og—- --— 

= E /(?/) log/(|/) - >C(r, g) + E /(y) • D{r{x\y) \\ q{x\y)) 

y&y y&y 

> J2fiy)^ogf{y)- m^ £{T,q) + 
yey ^^^2^®) 

The minimum value of D(p || q) is independent of p and q and is achieved if and 

only if both: 

q{x,y) = arg max £(T,^e) 
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rix\y) = q{x\y) 

are satisfied. The second condition is equivalent to p being the I-projection of a given 
q onto Pq-' 

p — argmmD{t\\q) 

= arg min D{f{y) ■ r{x\y) \\ q) 

r(x\y) 

So knowing the pair (p, q) e Pr x Q{Q) that minimizes D{p \\ q) imphes that the 
maximum hkehhood distribution q e (5(0) has been found and reciprocally, once the 
maximum likelihood distribution q e (5(0) is given we can find the p distribution in 
Pr that will minimize D{p \\ q),p E Pr, q £ 

□ 
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Appendix B 

Expectation Maximization as 
Alternating Minimization 



Let friy) be the relative frequency probability distribution induced on y by 
the collection of training samples T; this determines the set of desired distributions 

= {p{X,Y) : p{Y) = fr{Y)}. Let Q{e) = {q0{X,Y) : 9 e 0} he the model 
space. 

Proposition 3 One alternating minimization step between Pq- and Q{Q) is equiva- 
lent to an EM update step: 

EMr,eM = ^2 My)E,,^ix/Y)[log{qe{X,Y)\y)ie e (B.l) 
9i+i = argme^EMr,eM (B.2) 

One alternating minimization step starts from a given distribution g„ G Q{Q), 
finds the I-projection Pn of g„ onto Pq-', fixing j9„ we then find the 1-projection qn+i of 
Pn onto Q{Q). We will show that this leads to the EM update equations B.2. 

Given q^ G (3(6), Vp e Pr, we have: 

D{p\\qn) = y) log ^\ 

xex,yey Qn\x,y) 

= E f{y)-r{x\y)\og 

xex,yey qnK-i^, y) 



Figure B.l: Alternating minimization between P-j- and Q{Q) 



= E f(v) log ^ + E f^v) ■ D{r{x/y), qn{x/y)) 
yey ^"vyj yey ' ' 

V ^ 

independent of r{x\y) 



which imphes that: 

mm D(p\\q^) = ^/(y)logiM 

P^P^ ^y Qn{y) 

is achieved by p„ = f{y) ■ 

Now fixing Pn we seek the q e Q{Q) which minimizes D(pn \\ o)' 

DiPnWq) = E Pn{x,y)\0g^^^;^ 

xex,yey y*^-^' 
= E /(y)-?n(2:|y)log — — — — 

= E /(?/) log ^ + E /(?/) ■ [E ?n(x|z/) logqr.{x\y)] 

yey Qn[y) y(zy x&X 



independent of q{x,y) 

E f (y) Qn{x\y) log q{x,y) 
xex,yey 
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But the last term can rewritten as: 



f{y)(ln{x\y) log q{x,y) = ^ /(i/) ^ g„(a;|y)logg(a;,y) 

xex,yey y&y x&x 

yey 

^ V ' 

EMr,ei{0) 

Thus finding 

min D(pn 11 q) 



is equivalent to finding 
which is exactly the EM- update step (B.2). 



max EMr,,,(^) 

gGQ{e) 



□ 
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Appendix C 

N-best EM convergence 



In the "N-best" training paradigm we use only a subset of the conditional hidden 
event space X\y, for any given seen y. Associated with the model space Q{Q) we 
now have a family of strategies to sample from X\y a, set oi "N-best" hidden events 
X, for any y E y. Each samphng strategy is a function that associates a set of hidden 
sequences to a given observed sequence: s : 3^ — > 2"^. The family is parameterized by 



S{e)^{se:y^2'^,Wee} (C.l) 

Each 9 value identifies a particular sampling function. 
Let: 

ql{X,Y) - qe{X,Y)-h^^y){X) (C.2) 

^l(X\Y) ^ • (C.3) 

Q{S,Q) = {ql{X,Y):eee} (C.4) 

Proposition 4 Assuming that\f9 e Q,Sup{q0) = X xy ("smooth" qeix,y)) holds, 

one alternating minimization step between Pr and Q{S,Q) — 9i — >• 6*1+1 — is equiva- 
lent to: 



9i+i = argmaxY, fT{y)EqS{x\Y)[iog{qe{X,Y)\y)] 



(C.5) 
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if 9i+i satisfies: 

Only 6* e s.t. SQ^{^|) C sg{y),\/y e T are candidates in the M-step. 

Proof: 
E-step: 

Given e 0), find = /(y) ■r„(a;|y) e P(T) si. -D(/(?/) ■ rn{x\y) \\ 

q0.{x,y)) is minimized. As shown in appendix B: 

rnix\y)=qlix\y), Wy e (T) (C.7) 

Notice that for smooth qg.{x\y) we have: 

Supir,,{x\y)) = Sup{ql{x\y)) = s,,(y), Vy e T (C.8) 

M-ste|j: 

given p„{x,y) = /(y) • 9i+i e s.t. -D(pn || gli+J ^« minimized. 

Lemma 1 For the M-step we only need to consider candidates 6' e for which we 
have 

seAy)<^se{y),WyeT (C.9) 

Indeed, assuming that 3 {xo,yo) s.t. yo & T and xq e S0^{y) but xq ^ sg{y), we 
have: (xo,?/o) e Sup{f{y) ■ r„(a;||/)) (see (C.8)) and (a;o,?/o) ^ Sup{ql{x,y)) (see (C.2)) 
which means that f{yo) ■ rn{xo\yo) > and q'|(a;o,|/o) = 0, rendering 
D{f{y) -Tnixly) II ql{x,y)) = oo. 

□ 

Following the proof in appendix B, it is easy to show that: 

r = argmax5:/r(y)£;,..(x|y)[log(g,^(^,l^)|z/)] (C.IO) 
y&y 

minimizes D{pn || qD^O e 0. 
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Using the result in Lemma 1, only 9 E Q satisfying (C.9) are candidates for the 
M-step, so: 

r = arg max V /r(y)%(x|y)[logMX, y) • l,,(y)(X)|y)] (C.ll) 

But notice that Sup{qQ.{x\y)) — seXy)^ Vy e T (see (C.8)) and these are the only 
X values contributing to the conditional expectation on a given y ; for these however 
we have lsg{y){x) — 1 because of (C.9). This implies that (C.ll) can be rewritten as: 

r = arg max V /r(y)%(x|r)[log(g.(^, ^1^)1 (C.12) 

&&Q\s0^{y)'^se{y),^y&r »^ 

Because the set over which the maximization is carried over depends on Qi the 
M-step is not simple. However we notice that if the maximum on the entire space ©: 

^,+1 = argmax ^ h{y)E,s^ (x|y)[log(g,(X, Y)\y)\ (C.13) 

y^y 

satisfies: SQJyy') C sei+i(y), Vy e T, then ^j+i is the correct update Q* . 

□ 
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Appendix D 



Structured Language Model 
Parameter Reestimation 



The probability of a {W, T) sequence is obtained by chaining the probabihties of 
the elementary events in its derivation, as described in section 2.3: 

lenqt}i{d{W,T)) 

P{W,T) = P{d{W,T)) = n Pi^i) 

i=l 

The E-step is carried by samphng the space of hidden events for a given seen 
sequence W according to the pruning strategy outlined in section 2.5: 



The logarithm of the probability of a given derivation can be calculated as follows: 
log P0{W,T) 

length{d{W,T)) 

Y: logPeici) 

i=l 

length{d{W,T)) 

= E E E logPe(HM,i(-))-<5(e„(i.(-),i(-))) 
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length{d{W,T)) 

= E E #[(«^"^\^^™^) erf(P^,T)]-logP,(M(™),^("^)) 

where the random variable 

denotes the number of occurrences of the {u^'^\z^'^^) event in the derivation of W, T. 
Let 

weT 

We then have: 

£;p.(T|H^)[iogP,(w^,r)] 

and 

Y: f{W)-Eps^T\w)[logPe{W,T)] (D.2) 
-EE a,,(«(™\^(™))-logP,(«(™U(™)) (D.3) 

The E-step thus consists of the calculation of the expected values a,o^((u^"^\ z^"^^)), 
for every model component and every event in the derivations that sur- 

vived the pruning process. 

In the M-step we need to find a new parameter value 9i+i such that me maximize 
the EM auxiliary function (D.2): 

0i+^ = argm^ ^ /(P^)-£;p.(T|H^)[logP,(W^,r)] (D.4) 
= argmax^ ^ ae,{{u^"'\z^"'^)) ■logPeiu^'^\z^"'^) (D.5) 
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The parameters 9 are the maximal order joint counts C^"^\u^"^\ z^"^^) for each 
model component m e {WORD-PREDICTOR, TAGGER, PARSER }. 

One can easily notice that the M-step is in fact a problem of maximum likelihood 

estimation for each model component m from joint counts aQ^{{u^'^\ z^"^^)) . Taking 
into account the parameterization of Pg 

(^^(m)^^(m)) (see Section 2.4) the problem can 
be seen as an HMM reestimation problem. The EM algorithm can be employed to 
solve it. Convergence takes place in exactly one EM iteration to: 
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